Expectation-Maximization Algorithm.
|
|
- Jade McKinney
- 5 years ago
- Views:
Transcription
1 Expectatio-Maximizatio Algorithm. Petr Pošík Czech Techical Uiversity i Prague Faculty of Electrical Egieerig Dept. of Cyberetics MLE 2 Likelihood Icomplete data Geeral EM K-meas 7 Algorithm Illustratio EM view EM for Mixtures 16 Geeral mixture EM for Mixtures GMM EM for GMM EM for HMM 37 HMM HMM learig Sufficiet statistics Baum-Welch Summary 42 Competecies
2 Maximum likelihood estimatio 2 / 43 Likelihood maximizatio Let s have a radom variable X with probability distributio p X (x θ). This emphasizes that the distributio is parameterized by θ Θ, i.e. the distributio comes from certai parametric family. Θ is the space of possible parameter values. Learig task: assume the parameters θ are ukow, but we have a i.i.d. traiig dataset T = {x 1,..., x } which ca be used to estimate the ukow parameters. The probability of observig dataset T give some parameter values θ is p(x θ) = p X (x j θ) def = L(θ; T). j=1 This probability ca be iterpretted as a degree with which the model parameters θ coform to the data T. It is thus called the likelihood of parameters θ w.r.t. data T. The optimal θ is obtaied by maximizig the likelihood θ = arg max θ Θ L(θ; T) = arg max θ Θ j=1 p X (x j θ) Sice arg max x f(x) = arg max x log f(x), we ofte maximize the log-likelihood l(θ; T) = log L(θ; T) θ = arg max l(θ; T) = arg max log θ Θ θ Θ p X (x j θ) = arg max log p X (x j θ), j=1 θ Θ j=1 which is ofte easier tha maximizatio of L. P. Pošík c 217 Artificial Itelligece 3 / 43 Icomplete data Assume we caot observe the objects completely: r.v. X describes the observable part, r.v. K describes the uobservable, hidde part. We assume there is a uderlyig distributio p XK (x, k θ) of objects (x, k). Learig task: we wat to estimate the model parameters θ, but the traiig set cotais i.i.d. samples for the observable part oly, i.e. T X = {x 1,..., x }. (Still, there also exists a hidde, uobservable dataset T K = {k 1,..., k }.) If we had a complete data (T X, T K ), we could directly optimize l(θ; T X, T K ) = log p(t X, T K θ). But we do ot have access to T K. If we would like to maximize l(θ; T X ) = log p(t X θ) = log T K p(t X, T K θ), the summatio iside log() results i complicated expressios, or we would have to use umerical methods. Our state of kowledge about T K is give by p(t K T X, θ). The complete-data likelihood L(θ; T X, T K ) = P(T X, T K θ) is a radom variable sice T K is ukow, radom, but govered by the uderlyig distributio. Istead of optimizig it directly, cosider its expected value uder the posterior distributio over latet variables (E-step), ad the maximize this expectatio (M-step). P. Pošík c 217 Artificial Itelligece 4 / 43 2
3 Expectatio-Maximizatio algorithm EM algorithm: A geeral method of fidig MLE of prob. dist. parameters from a give dataset whe data is icomplete (hidde variables, or missig values). Hidde variables: mixture models, Hidde Markov models,... It is a family of algorithms, or a recipe to derive a ML estimatio algorithm for various kids of probabilistic models. 1. Preted that you kow θ. (Use some iitial guess θ ().) Set iteratio couter i = E-step: Use the curret parameter values θ (i 1) to fid the posterior distributio of the latet variables P(T K T X, θ (i 1) ). Use this posterior distributio to fid the expectatio of the complete-data log-likelihood evaluated for some geeral parameter values θ: Q(θ, θ (i 1) ) = T K p(t K T X, θ (i 1) ) log p(t X, T K θ). 3. M-step: maximize the expectatio, i.e. compute a updated estimate of θ as θ (i) = arg max θ Θ Q(θ, θ(i 1) ). 4. Check for covergece: fiish, or advace the iteratio couter i = i+1, ad repeat from 2. P. Pošík c 217 Artificial Itelligece 5 / 43 EM algorithm features Pros: Amog the possible optimizatio methods, EM exploits the structure of the model. For p X K from expoetial family: M-step ca be doe aalytically ad there is a uique optimizer. The expected value i the E-step ca be expressed as a fuctio of θ without solvig it explicitly for each θ. p X (T X θ (i+1) ) p X (T X θ (i) ), i.e. the process fids a local optimum. Works well i practice. Cos: Not guarateed to get globally optimal estimate. MLE ca overfit; use MAP istead (EM ca be used as well). Covergece may be slow. P. Pošík c 217 Artificial Itelligece 6 / 43 3
4 K-meas 7 / 43 K-meas algorithm Clusterig is oe of the tasks of usupervised learig. K-meas algorithm for clusterig [Mac67]: K is the apriori give umber of clusters. Algorithm: 1. Choose K cetroids µ k (i almost ay way, but every cluster should have at least oe example.) 2. For all x, assig x to its closest µ k. 3. Compute the ew positio of cetroids µ k based o all examples x i, i I k, i cluster k. 4. If the positios of cetroids chaged, repeat from 2. Algorithm features: Algorithm miimizes the fuctio (itracluster variace): k j J = xi,j c j 2 j=1 i=1 (1) Algorithm is fast, but each time it ca coverge to a differet local optimum of J. [Mac67] J. B. MacQuee. Some methods for classificatio ad aalysis of multivariate observatios. I Proceedigs of 5-th Berkeley Symposium o Mathematical Statistics ad Probability, volume 1, pages , Berkeley, Uiversity of Califoria Press. P. Pošík c 217 Artificial Itelligece 8 / 43 Illustratio K meas clusterig: iteratio P. Pošík c 217 Artificial Itelligece 9 / 43 4
5 Illustratio K meas clusterig: iteratio P. Pošík c 217 Artificial Itelligece / 43 Illustratio K meas clusterig: iteratio P. Pošík c 217 Artificial Itelligece 11 / 43 5
6 Illustratio K meas clusterig: iteratio P. Pošík c 217 Artificial Itelligece 12 / 43 Illustratio K meas clusterig: iteratio P. Pošík c 217 Artificial Itelligece 13 / 43 6
7 Illustratio K meas clusterig: iteratio P. Pošík c 217 Artificial Itelligece 14 / 43 K-meas: EM view Assume: A object ca be i oe of the K states with equal probabilities. All p X K (x k) are isotropic Gaussias: p X K (x k) = N(x µ k, σi). Recogitio (Part of E-step): The task is to decide the state k for each x, assumig all µ k are kow. The Bayesia strategy (miimizes the probability of error) chooses the cluster which ceter is the closest to observatio x: q (x) = arg mi k K (x µ k) 2 If µ k, k K, are ot kow, it is a parametrized strategy q Θ (x), where Θ = (µ k ) K k=1. Decidig state k for each x assumig kow µ k is actually the computatio of a degeerate probability distributio p(t K T X, θ (i 1) ), i.e. the first part of E-step. Learig (The rest of E-step ad M-step): Fid the maximum-likelihood estimates of µ k based o kow (x 1, k 1 ),...,(x l, k l ): µ k = 1 I k i I k x i, where I k is a set of idices of traiig examples (curretly) belogig to state k. This completes the E-step ad implemets the M-step. P. Pošík c 217 Artificial Itelligece 15 / 43 7
8 EM for Mixture Models 16 / 43 Geeral mixture distributios Assume the data are samples from a distributio factorized as p XK (x, k) = p K (k)p X K (x k), i.e. p X (x) = p K (k)p X K (x k) k K ad that the distributio is kow (except the distributio parameters). Recogitio (Part of E-step): Let s defie the result of recogitio ot as a sigle decisio for some state k (as doe i K-meas), but rather as a set of posterior probabilities (sometimes called resposibilities) for all k give x i γ k (x i ) = p K X (k x i, θ (t) p X K (x i k)p K (k) ) = k K p X K (x i k)p K (k) that a object was i state k whe observatio x i was made. The γ k (x) fuctios ca be viewed as discrimiat fuctios. P. Pošík c 217 Artificial Itelligece 17 / 43 Geeral mixture distributios (cot.) Learig (The rest of E-step ad M-step): Give the traiig multiset T = (x i, k i ) i=1 (or the respective γ k(x i ) istead of k i ), assume γ k (x) is kow, p K (k) are ot kow, ad p X K (x k) are kow except the parameter values Θ k, i.e. we shall write p X K (x k, Θ k ). Let the object model m be a set of all ukow parameters m = (p K (k), Θ k ) k K. The log-likelihood of model m if we assume k i is kow: log L(m) = log i=1 p XK (x i, k i ) = log p K (k i )+ log p X K (x i k i, Θ ki ) i=1 i=1 The log-likelihood of model m if we assume a distributio (γ) over k is kow: log L(m) = i=1 k K γ k (x i ) log p K (k)+ γ k (x i ) log p X K (x i k, Θ k ) i=1 k K We search for the optimal model usig maximum likelihood: m = (p K (k), Θ k i.e. we compute ) = arg max log L(m) m p K (k) = 1 γ k (x i ) ad solve k idepedet tasks i=1 Θ k = arg max γ k (x i ) log p X K (x i k, Θ k ). Θ k i=1 P. Pošík c 217 Artificial Itelligece 18 / 43 8
9 EM for mixture distributio Usupervised learig algorithm [?] for geeral mixture distributios: 1. Iitialize the model parameters m = ((p K (k), Θ k ) k). 2. Perform the recogitio task, i.e. assumig m is kow, compute γ k (x i ) = ˆp K X (k x i ) = p K(k)p X K (x i k, Θ k ) j K p K (j)p X K (x i j, Θ j ). 3. Perform the learig task, i.e. assumig γ k (x i ) are kow, update the ML estimates of the model parameters p K (k) ad Θ k for all k: p K (k) = 1 γ k (x i ) i=1 Θ k = arg max γ k (x i ) log p X K (x i k, Θ k ) Θ k i=1 4. Iterate 2 ad 3 util the model stabilizes. Features: The algorithm does ot specify how to update Θ k i step 3, it depeds o the chose form of p X K. The model created i iteratio t is always at least as good as the model from iteratio t 1, i.e. L(m) = p(t m) icreases. [Mac67] J. B. MacQuee. Some methods for classificatio ad aalysis of multivariate observatios. I Proceedigs of 5-th Berkeley Symposium o Mathematical Statistics ad Probability, volume 1, pages , Berkeley, Uiversity of Califoria Press. P. Pošík c 217 Artificial Itelligece 19 / 43 Special Case: Gaussia Mixture Model Each kth compoet is a Gaussia distributio: 1 N(x µ k, Σ k ) = (2π) D 2 Σ k 1 2 Gaussia Mixture Model (GMM): exp{ 1 2 (x µ k) T Σ 1 k (x µ k )} K K p(x) = p K (k)p X K (x k, Θ k ) = α k N(x µ k, Σ k ) k=1 k=1 assumig K α k = 1 ad α k 1 k=1 5 x P. Pošík c 217 Artificial Itelligece 2 / 43 9
10 EM for GMM 1. Iitialize the model parameters m = ((p K (k), µ k, Σ k ) k). 2. Perform the recogitio task as i the geeral case, i.e. assumig m is kow, compute γ k (x i ) = ˆp K X (k x i ) = p K(k)p X K (x i k, Θ k ) j K p K (j)p X K (x i j, Θ j ) = α kn(x i µ k, Σ k ) j K α j N(x i µ j, Σ j ). 3. Perform the learig task, i.e. assumig γ k (x i ) are kow, update the ML estimates of the model parameters α k, µ k ad Σ k for all k: α k = p K (k) = 1 γ k (x i ) i=1 µ k = i=1 γ k(x i )x i i=1 γ k(x i ) Σ k = i=1 γ k(x i )(x i µ k )(x i µ k ) T i=1 γ k(x i ) 4. Iterate 2 ad 3 util the model stabilizes. Remarks: Each data poit belogs to all compoets to a certai degree γ k (x i ). The eq. for µ k is just a weighted average of x i s. The eq. for Σ k is just a weighted covariace matrix. P. Pošík c 217 Artificial Itelligece 21 / 43 Example: Source data Source data geerated from 3 Gaussias P. Pošík c 217 Artificial Itelligece 22 / 43
11 Example: Iput to EM algorithm 5 The data were give to the EM algorithm as a ulabeled dataset P. Pošík c 217 Artificial Itelligece 23 / 43 Example: EM Iteratios P. Pošík c 217 Artificial Itelligece 24 / 43 11
12 Example: EM Iteratios P. Pošík c 217 Artificial Itelligece 25 / 43 Example: EM Iteratios P. Pošík c 217 Artificial Itelligece 26 / 43 12
13 Example: EM Iteratios P. Pošík c 217 Artificial Itelligece 27 / 43 Example: EM Iteratios P. Pošík c 217 Artificial Itelligece 28 / 43 13
14 Example: EM Iteratios P. Pošík c 217 Artificial Itelligece 29 / 43 Example: EM Iteratios P. Pošík c 217 Artificial Itelligece 3 / 43 14
15 Example: EM Iteratios P. Pošík c 217 Artificial Itelligece 31 / 43 Example: EM Iteratios P. Pošík c 217 Artificial Itelligece 32 / 43 15
16 Example: EM Iteratios P. Pošík c 217 Artificial Itelligece 33 / 43 Example: EM Iteratios P. Pošík c 217 Artificial Itelligece 34 / 43 16
17 Example: EM Iteratios P. Pošík c 217 Artificial Itelligece 35 / 43 Example: Groud Truth ad EM Estimate The groud truth (left) ad the EM estimate (right) are very close because we have eough data, we kow the right umber of compoets, ad we were lucky that EM coverged to the right local optimum of the likelihood fuctio. P. Pošík c 217 Artificial Itelligece 36 / 43 17
18 Baum-Welch Algorithm: EM for HMM 37 / 43 Hidde Markov Model 1st order HMM is a geerative probabilistic model formed by a sequece of hidde variables X,..., X t, the domai of all of them is the set of states {s 1,..., s N }. a sequece of observed variables E 1,..., E t, the domai of all of them is the set of observatios {v 1,..., v M }. a iitial distributio over hidde states P(X ), a trasitio model P(X t X t 1 ), ad a emissio model P(E t X t ). Simulatig HMM: 1. Geerate a iitial state x accordig to P(X ). Set t Geerate a ew curret state x t accordig to P(X t x t 1 ). 3. Geerate a observatio e t accordig to P(E t x t ). 4. Advace time t t Fiish, or repeat from step 2. With HMM: efficiet algorithms exist for solvig iferece tasks; but we have o idea (so far) how to lear HMM parameters from the observatio sequece, because we do ot have access to the hidde states. P. Pošík c 217 Artificial Itelligece 38 / 43 Learig HMM from data Is it possible to lear HMM from data? No kow way to aalytically solve for the model which maximizes the probability of observatios. No optimal way of estimatig the model parameters from the observatio sequeces. We ca fid model parameters such that the probability of observatios is maximized Baum-Welch algorithm (a special case of EM). Let s use a slightly differet otatio to emphasize the model parameters: π = [π i ] = [P(X 1 = s i )]... vector of the iitial probabilities of states A = [a i,j ] = [P(X t = s j X t 1 = s i )]... the matrix of trasitio probabilities to ext state give the curret state B = [b i,k ] = [P(E t = v k X t = s i )]... the matrix of observatio probabilities give the curret state The whole set of HMM parameters is the θ = (π, A, B) The algorithm (preseted o the ext slides) will compute the expected umbers of beig i a state or takig a trasitio give the observatios ad the curret model parameters θ = (π, A, B), ad the compute the ew estimate of model parameters θ = (π, A, B ), such that P(e t 1 θ ) P(e t 1 θ). P. Pošík c 217 Artificial Itelligece 39 / 43 18
19 Sufficiet statistics Let s defie the probability of trasitio from state s i at time t to state s j at time t+1, give the model ad the observatio sequece e t 1 : ξ t (i, j) = P(X t = s i, X t+1 = s j e t 1, θ) = α t(s i )a ij b jk β t+1 (s j ) P(e t 1 θ) = = α t (s i )a ij b jk β t+1 (s j ) N i=1 N j=1 α t(s i )a ij b jk β t+1 (s j ), where α t ad β t are the forward ad backward messages computed by the forward-backward algorithm, ad the probability of beig i state s i at time t, give the model ad the observatio sequece: γ t (i) = The we ca iterpret N ξ t (i, j). j=1 T 1 γ k (i) as the expected umber of trasitios from state s i, ad k=1 T 1 ξ k (i, j) as the expected umber of trasitios from s i to s j. k=1 P. Pošík c 217 Artificial Itelligece 4 / 43 Baum-Welch algorithm The re-estimatio formulas are π i = expected frequecy of beig i state s i at time (t = 1) = = γ 1 (i) a ij = expected umber of trasitios from s i to s j expected umber of trasitios from s i = = T 1 k=1 ξ k(i, j) T 1 k=1 γ k(i) b jk = expected umber of times beig i state s j ad observig v k expected umber of times beig i state s j = = T t=1 I(e t = v k )γ t (j) T t=1 γ t(j) As with other EM variats, with the old model parameters θ = (π, A, B) ad ew, re-estimated parameters θ = (π, A, B ), the ew model is at least as likely as the old oe: P(e t 1 θ ) P(e t 1 θ) The above equatios are used iteratively with θ takig place of θ. P. Pošík c 217 Artificial Itelligece 41 / 43 19
20 Summary 42 / 43 Competecies After this lecture, a studet shall be able to... defie ad explai the task of maximum likelihood estimatio; explai why we ca maximize log-likelihood istead of likelihood, describe the advatages; describe the issues we face whe tryig to maximize the likelihood i case of icomplete data; explai the geeral high-level priciple of Expectatio-Maximizatio algorithm; describe the pros ad cos of the EM algorithm, especially what happes with the likelihood i oe EM iteratio; describe the EM algorithm for mixture distributios, icludig the otio of resposibilities; explai the Baum-Welch algorithm, i.e. the applicatio of EM to HMM; what parameters are leared ad how (coceptually). P. Pošík c 217 Artificial Itelligece 43 / 43 2
The Expectation-Maximization (EM) Algorithm
The Expectatio-Maximizatio (EM) Algorithm Readig Assigmets T. Mitchell, Machie Learig, McGraw-Hill, 997 (sectio 6.2, hard copy). S. Gog et al. Dyamic Visio: From Images to Face Recogitio, Imperial College
More informationClustering. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar.
Clusterig CM226: Machie Learig for Bioiformatics. Fall 216 Sriram Sakararama Ackowledgmets: Fei Sha, Ameet Talwalkar Clusterig 1 / 42 Admiistratio HW 1 due o Moday. Email/post o CCLE if you have questios.
More informationAlgorithms for Clustering
CR2: Statistical Learig & Applicatios Algorithms for Clusterig Lecturer: J. Salmo Scribe: A. Alcolei Settig: give a data set X R p where is the umber of observatio ad p is the umber of features, we wat
More informationMixtures of Gaussians and the EM Algorithm
Mixtures of Gaussias ad the EM Algorithm CSE 6363 Machie Learig Vassilis Athitsos Computer Sciece ad Egieerig Departmet Uiversity of Texas at Arligto 1 Gaussias A popular way to estimate probability desity
More informationOutline. CSCI-567: Machine Learning (Spring 2019) Outline. Prof. Victor Adamchik. Mar. 26, 2019
Outlie CSCI-567: Machie Learig Sprig 209 Gaussia mixture models Prof. Victor Adamchik 2 Desity estimatio U of Souther Califoria Mar. 26, 209 3 Naive Bayes Revisited March 26, 209 / 57 March 26, 209 2 /
More informationStatistical Pattern Recognition
Statistical Patter Recogitio Classificatio: No-Parametric Modelig Hamid R. Rabiee Jafar Muhammadi Sprig 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2/ Ageda Parametric Modelig No-Parametric Modelig
More informationClustering: Mixture Models
Clusterig: Mixture Models Machie Learig 10-601B Seyoug Kim May of these slides are derived from Tom Mitchell, Ziv- Bar Joseph, ad Eric Xig. Thaks! Problem with K- meas Hard Assigmet of Samples ito Three
More informationECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015
ECE 8527: Itroductio to Machie Learig ad Patter Recogitio Midterm # 1 Vaishali Ami Fall, 2015 tue39624@temple.edu Problem No. 1: Cosider a two-class discrete distributio problem: ω 1 :{[0,0], [2,0], [2,2],
More informationCS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5
CS434a/54a: Patter Recogitio Prof. Olga Veksler Lecture 5 Today Itroductio to parameter estimatio Two methods for parameter estimatio Maimum Likelihood Estimatio Bayesia Estimatio Itroducto Bayesia Decisio
More informationRegression and generalization
Regressio ad geeralizatio CE-717: Machie Learig Sharif Uiversity of Techology M. Soleymai Fall 2016 Curve fittig: probabilistic perspective Describig ucertaity over value of target variable as a probability
More informationCS284A: Representations and Algorithms in Molecular Biology
CS284A: Represetatios ad Algorithms i Molecular Biology Scribe Notes o Lectures 3 & 4: Motif Discovery via Eumeratio & Motif Represetatio Usig Positio Weight Matrix Joshua Gervi Based o presetatios by
More informationGrouping 2: Spectral and Agglomerative Clustering. CS 510 Lecture #16 April 2 nd, 2014
Groupig 2: Spectral ad Agglomerative Clusterig CS 510 Lecture #16 April 2 d, 2014 Groupig (review) Goal: Detect local image features (SIFT) Describe image patches aroud features SIFT, SURF, HoG, LBP, Group
More informationChapter 12 EM algorithms The Expectation-Maximization (EM) algorithm is a maximum likelihood method for models that have hidden variables eg. Gaussian
Chapter 2 EM algorithms The Expectatio-Maximizatio (EM) algorithm is a maximum likelihood method for models that have hidde variables eg. Gaussia Mixture Models (GMMs), Liear Dyamic Systems (LDSs) ad Hidde
More informationExponential Families and Bayesian Inference
Computer Visio Expoetial Families ad Bayesia Iferece Lecture Expoetial Families A expoetial family of distributios is a d-parameter family f(x; havig the followig form: f(x; = h(xe g(t T (x B(, (. where
More informationLecture 11 and 12: Basic estimation theory
Lecture ad 2: Basic estimatio theory Sprig 202 - EE 94 Networked estimatio ad cotrol Prof. Kha March 2 202 I. MAXIMUM-LIKELIHOOD ESTIMATORS The maximum likelihood priciple is deceptively simple. Louis
More informationEmpirical Process Theory and Oracle Inequalities
Stat 928: Statistical Learig Theory Lecture: 10 Empirical Process Theory ad Oracle Iequalities Istructor: Sham Kakade 1 Risk vs Risk See Lecture 0 for a discussio o termiology. 2 The Uio Boud / Boferoi
More informationThe Bayesian Learning Framework. Back to Maximum Likelihood. Naïve Bayes. Simple Example: Coin Tosses. Given a generative model
Back to Maximum Likelihood Give a geerative model f (x, y = k) =π k f k (x) Usig a geerative modellig approach, we assume a parametric form for f k (x) =f (x; k ) ad compute the MLE θ of θ =(π k, k ) k=
More informationAxis Aligned Ellipsoid
Machie Learig for Data Sciece CS 4786) Lecture 6,7 & 8: Ellipsoidal Clusterig, Gaussia Mixture Models ad Geeral Mixture Models The text i black outlies high level ideas. The text i blue provides simple
More informationEECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1
EECS564 Estimatio, Filterig, ad Detectio Hwk 2 Sols. Witer 25 4. Let Z be a sigle observatio havig desity fuctio where. p (z) = (2z + ), z (a) Assumig that is a oradom parameter, fid ad plot the maximum
More information1.010 Uncertainty in Engineering Fall 2008
MIT OpeCourseWare http://ocw.mit.edu.00 Ucertaity i Egieerig Fall 2008 For iformatio about citig these materials or our Terms of Use, visit: http://ocw.mit.edu.terms. .00 - Brief Notes # 9 Poit ad Iterval
More informationLECTURE NOTES 9. 1 Point Estimation. 1.1 The Method of Moments
LECTURE NOTES 9 Poit Estimatio Uder the hypothesis that the sample was geerated from some parametric statistical model, a atural way to uderstad the uderlyig populatio is by estimatig the parameters of
More informationExpectation maximization
Motivatio Expectatio maximizatio Subhrasu Maji CMSCI 689: Machie Learig 14 April 015 Suppose you are builig a aive Bayes spam classifier. After your are oe your boss tells you that there is o moey to label
More informationCSE 527, Additional notes on MLE & EM
CSE 57 Lecture Notes: MLE & EM CSE 57, Additioal otes o MLE & EM Based o earlier otes by C. Grat & M. Narasimha Itroductio Last lecture we bega a examiatio of model based clusterig. This lecture will be
More informationInfinite Sequences and Series
Chapter 6 Ifiite Sequeces ad Series 6.1 Ifiite Sequeces 6.1.1 Elemetary Cocepts Simply speakig, a sequece is a ordered list of umbers writte: {a 1, a 2, a 3,...a, a +1,...} where the elemets a i represet
More informationECE 901 Lecture 12: Complexity Regularization and the Squared Loss
ECE 90 Lecture : Complexity Regularizatio ad the Squared Loss R. Nowak 5/7/009 I the previous lectures we made use of the Cheroff/Hoeffdig bouds for our aalysis of classifier errors. Hoeffdig s iequality
More informationDiscrete Mathematics for CS Spring 2008 David Wagner Note 22
CS 70 Discrete Mathematics for CS Sprig 2008 David Wager Note 22 I.I.D. Radom Variables Estimatig the bias of a coi Questio: We wat to estimate the proportio p of Democrats i the US populatio, by takig
More informationMATH 320: Probability and Statistics 9. Estimation and Testing of Parameters. Readings: Pruim, Chapter 4
MATH 30: Probability ad Statistics 9. Estimatio ad Testig of Parameters Estimatio ad Testig of Parameters We have bee dealig situatios i which we have full kowledge of the distributio of a radom variable.
More informationLecture 9: September 19
36-700: Probability ad Mathematical Statistics I Fall 206 Lecturer: Siva Balakrisha Lecture 9: September 9 9. Review ad Outlie Last class we discussed: Statistical estimatio broadly Pot estimatio Bias-Variace
More information15-780: Graduate Artificial Intelligence. Density estimation
5-780: Graduate Artificial Itelligece Desity estimatio Coditioal Probability Tables (CPT) But where do we get them? P(B)=.05 B P(E)=. E P(A B,E) )=.95 P(A B, E) =.85 P(A B,E) )=.5 P(A B, E) =.05 A P(J
More informationProbabilistic Unsupervised Learning
HT2015: SC4 Statistical Data Miig ad Machie Learig Dio Sejdiovic Departmet of Statistics Oxford http://www.stats.ox.ac.u/~sejdiov/sdmml.html Probabilistic Methods Algorithmic approach: Data Probabilistic
More informationLecture 2: Monte Carlo Simulation
STAT/Q SCI 43: Itroductio to Resamplig ethods Sprig 27 Istructor: Ye-Chi Che Lecture 2: ote Carlo Simulatio 2 ote Carlo Itegratio Assume we wat to evaluate the followig itegratio: e x3 dx What ca we do?
More informationLet us give one more example of MLE. Example 3. The uniform distribution U[0, θ] on the interval [0, θ] has p.d.f.
Lecture 5 Let us give oe more example of MLE. Example 3. The uiform distributio U[0, ] o the iterval [0, ] has p.d.f. { 1 f(x =, 0 x, 0, otherwise The likelihood fuctio ϕ( = f(x i = 1 I(X 1,..., X [0,
More informationECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization
ECE 90 Lecture 4: Maximum Likelihood Estimatio ad Complexity Regularizatio R Nowak 5/7/009 Review : Maximum Likelihood Estimatio We have iid observatios draw from a ukow distributio Y i iid p θ, i,, where
More informationLecture 13: Maximum Likelihood Estimation
ECE90 Sprig 007 Statistical Learig Theory Istructor: R. Nowak Lecture 3: Maximum Likelihood Estimatio Summary of Lecture I the last lecture we derived a risk (MSE) boud for regressio problems; i.e., select
More informationProbability and MLE.
10-701 Probability ad MLE http://www.cs.cmu.edu/~pradeepr/701 (brief) itro to probability Basic otatios Radom variable - referrig to a elemet / evet whose status is ukow: A = it will rai tomorrow Domai
More informationSolution of Final Exam : / Machine Learning
Solutio of Fial Exam : 10-701/15-781 Machie Learig Fall 2004 Dec. 12th 2004 Your Adrew ID i capital letters: Your full ame: There are 9 questios. Some of them are easy ad some are more difficult. So, if
More informationUnsupervised Learning 2001
Usupervised Learig 2001 Lecture 3: The EM Algorithm Zoubi Ghahramai zoubi@gatsby.ucl.ac.uk Carl Edward Rasmusse edward@gatsby.ucl.ac.uk Gatsby Computatioal Neurosciece Uit MSc Itelliget Systems, Computer
More informationJanuary 25, 2017 INTRODUCTION TO MATHEMATICAL STATISTICS
Jauary 25, 207 INTRODUCTION TO MATHEMATICAL STATISTICS Abstract. A basic itroductio to statistics assumig kowledge of probability theory.. Probability I a typical udergraduate problem i probability, we
More informationApplication to Random Graphs
A Applicatio to Radom Graphs Brachig processes have a umber of iterestig ad importat applicatios. We shall cosider oe of the most famous of them, the Erdős-Réyi radom graph theory. 1 Defiitio A.1. Let
More informationJacob Hays Amit Pillay James DeFelice 4.1, 4.2, 4.3
No-Parametric Techiques Jacob Hays Amit Pillay James DeFelice 4.1, 4.2, 4.3 Parametric vs. No-Parametric Parametric Based o Fuctios (e.g Normal Distributio) Uimodal Oly oe peak Ulikely real data cofies
More information10-701/ Machine Learning Mid-term Exam Solution
0-70/5-78 Machie Learig Mid-term Exam Solutio Your Name: Your Adrew ID: True or False (Give oe setece explaatio) (20%). (F) For a cotiuous radom variable x ad its probability distributio fuctio p(x), it
More information3/8/2016. Contents in latter part PATTERN RECOGNITION AND MACHINE LEARNING. Dynamical Systems. Dynamical Systems. Linear Dynamical Systems
Cotets i latter part PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 13: SEQUENTIAL DATA Liear Dyamical Systems What is differet from HMM? Kalma filter Its stregth ad limitatio Particle Filter Its simple
More informationDistributional Similarity Models (cont.)
Distributioal Similarity Models (cot.) Regia Barzilay EECS Departmet MIT October 19, 2004 Sematic Similarity Vector Space Model Similarity Measures cosie Euclidea distace... Clusterig k-meas hierarchical
More informationLinear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d
Liear regressio Daiel Hsu (COMS 477) Maximum likelihood estimatio Oe of the simplest liear regressio models is the followig: (X, Y ),..., (X, Y ), (X, Y ) are iid radom pairs takig values i R d R, ad Y
More informationVector Quantization: a Limiting Case of EM
. Itroductio & defiitios Assume that you are give a data set X = { x j }, j { 2,,, }, of d -dimesioal vectors. The vector quatizatio (VQ) problem requires that we fid a set of prototype vectors Z = { z
More informationLecture 10 October Minimaxity and least favorable prior sequences
STATS 300A: Theory of Statistics Fall 205 Lecture 0 October 22 Lecturer: Lester Mackey Scribe: Brya He, Rahul Makhijai Warig: These otes may cotai factual ad/or typographic errors. 0. Miimaxity ad least
More informationDistributional Similarity Models (cont.)
Sematic Similarity Vector Space Model Similarity Measures cosie Euclidea distace... Clusterig k-meas hierarchical Last Time EM Clusterig Soft versio of K-meas clusterig Iput: m dimesioal objects X = {
More information8 : Learning Partially Observed GM: the EM algorithm
10-708: Probabilistic Graphical Models, Sprig 2015 8 : Learig Partially Observed GM: the EM algorithm Lecturer: Eric P. Xig Scribes: Auric Qiao, Hao Zhag, Big Liu 1 Itroductio Two fudametal questios i
More informationLecture 2 February 8, 2016
MIT 6.854/8.45: Advaced Algorithms Sprig 206 Prof. Akur Moitra Lecture 2 February 8, 206 Scribe: Calvi Huag, Lih V. Nguye I this lecture, we aalyze the problem of schedulig equal size tasks arrivig olie
More informationRandom Variables, Sampling and Estimation
Chapter 1 Radom Variables, Samplig ad Estimatio 1.1 Itroductio This chapter will cover the most importat basic statistical theory you eed i order to uderstad the ecoometric material that will be comig
More informationMachine Learning Brett Bernstein
Machie Learig Brett Berstei Week 2 Lecture: Cocept Check Exercises Starred problems are optioal. Excess Risk Decompositio 1. Let X = Y = {1, 2,..., 10}, A = {1,..., 10, 11} ad suppose the data distributio
More informationIntroduction to Extreme Value Theory Laurens de Haan, ISM Japan, Erasmus University Rotterdam, NL University of Lisbon, PT
Itroductio to Extreme Value Theory Laures de Haa, ISM Japa, 202 Itroductio to Extreme Value Theory Laures de Haa Erasmus Uiversity Rotterdam, NL Uiversity of Lisbo, PT Itroductio to Extreme Value Theory
More information1 Review and Overview
DRAFT a fial versio will be posted shortly CS229T/STATS231: Statistical Learig Theory Lecturer: Tegyu Ma Lecture #3 Scribe: Migda Qiao October 1, 2013 1 Review ad Overview I the first half of this course,
More informationMachine Learning Theory (CS 6783)
Machie Learig Theory (CS 6783) Lecture 2 : Learig Frameworks, Examples Settig up learig problems. X : istace space or iput space Examples: Computer Visio: Raw M N image vectorized X = 0, 255 M N, SIFT
More informationMarkov Decision Processes
Markov Decisio Processes Defiitios; Statioary policies; Value improvemet algorithm, Policy improvemet algorithm, ad liear programmig for discouted cost ad average cost criteria. Markov Decisio Processes
More informationChapter 2 The Monte Carlo Method
Chapter 2 The Mote Carlo Method The Mote Carlo Method stads for a broad class of computatioal algorithms that rely o radom sampligs. It is ofte used i physical ad mathematical problems ad is most useful
More informationMachine Learning Regression I Hamid R. Rabiee [Slides are based on Bishop Book] Spring
Machie Learig Regressio I Hamid R. Rabiee [Slides are based o Bishop Book] Sprig 015 http://ce.sharif.edu/courses/93-94//ce717-1 Liear Regressio Liear regressio: ivolves a respose variable ad a sigle predictor
More informationStat 421-SP2012 Interval Estimation Section
Stat 41-SP01 Iterval Estimatio Sectio 11.1-11. We ow uderstad (Chapter 10) how to fid poit estimators of a ukow parameter. o However, a poit estimate does ot provide ay iformatio about the ucertaity (possible
More informationPattern Classification, Ch4 (Part 1)
Patter Classificatio All materials i these slides were take from Patter Classificatio (2d ed) by R O Duda, P E Hart ad D G Stork, Joh Wiley & Sos, 2000 with the permissio of the authors ad the publisher
More informationEXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY
EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY GRADUATE DIPLOMA, 016 MODULE : Statistical Iferece Time allowed: Three hours Cadidates should aswer FIVE questios. All questios carry equal marks. The umber
More informationECE 901 Lecture 13: Maximum Likelihood Estimation
ECE 90 Lecture 3: Maximum Likelihood Estimatio R. Nowak 5/7/009 The focus of this lecture is to cosider aother approach to learig based o maximum likelihood estimatio. Ulike earlier approaches cosidered
More informationEE 6885 Statistical Pattern Recognition
EE 6885 Statistical Patter Recogitio Fall 5 Prof. Shih-Fu Chag http://www.ee.columbia.edu/~sfchag Lecture 6 (9/8/5 EE6887-Chag 6- Readig EM for Missig Features Textboo, DHS 3.9 Bayesia Parameter Estimatio
More information6.867 Machine learning
6.867 Machie learig Mid-term exam October, ( poits) Your ame ad MIT ID: Problem We are iterested here i a particular -dimesioal liear regressio problem. The dataset correspodig to this problem has examples
More informationComputing the maximum likelihood estimates: concentrated likelihood, EM-algorithm. Dmitry Pavlyuk
Computig the maximum likelihood estimates: cocetrated likelihood, EM-algorithm Dmitry Pavlyuk The Mathematical Semiar, Trasport ad Telecommuicatio Istitute, Riga, 13.05.2016 Presetatio outlie 1. Basics
More informationThree classification models Discriminant Model: learn the decision boundary directly and apply it to determine the class of each data point
Review of Last Wee Three classificatio models Discrimiat Model: lear the decisio boudary directly ad aly it to determie the class of each data oit Discrimiative Model: lear PY directly Geerative Model:
More informationIntroduction to Machine Learning DIS10
CS 189 Fall 017 Itroductio to Machie Learig DIS10 1 Fu with Lagrage Multipliers (a) Miimize the fuctio such that f (x,y) = x + y x + y = 3. Solutio: The Lagragia is: L(x,y,λ) = x + y + λ(x + y 3) Takig
More informationHypothesis Testing. Evaluation of Performance of Learned h. Issues. Trade-off Between Bias and Variance
Hypothesis Testig Empirically evaluatig accuracy of hypotheses: importat activity i ML. Three questios: Give observed accuracy over a sample set, how well does this estimate apply over additioal samples?
More informationChapter 6 Principles of Data Reduction
Chapter 6 for BST 695: Special Topics i Statistical Theory. Kui Zhag, 0 Chapter 6 Priciples of Data Reductio Sectio 6. Itroductio Goal: To summarize or reduce the data X, X,, X to get iformatio about a
More informationStatistical Inference (Chapter 10) Statistical inference = learn about a population based on the information provided by a sample.
Statistical Iferece (Chapter 10) Statistical iferece = lear about a populatio based o the iformatio provided by a sample. Populatio: The set of all values of a radom variable X of iterest. Characterized
More informationMaximum Likelihood Estimation and Complexity Regularization
ECE90 Sprig 004 Statistical Regularizatio ad Learig Theory Lecture: 4 Maximum Likelihood Estimatio ad Complexity Regularizatio Lecturer: Rob Nowak Scribe: Pam Limpiti Review : Maximum Likelihood Estimatio
More informationREGRESSION WITH QUADRATIC LOSS
REGRESSION WITH QUADRATIC LOSS MAXIM RAGINSKY Regressio with quadratic loss is aother basic problem studied i statistical learig theory. We have a radom couple Z = X, Y ), where, as before, X is a R d
More informationLecture 19: Convergence
Lecture 19: Covergece Asymptotic approach I statistical aalysis or iferece, a key to the success of fidig a good procedure is beig able to fid some momets ad/or distributios of various statistics. I may
More informationSequences, Mathematical Induction, and Recursion. CSE 2353 Discrete Computational Structures Spring 2018
CSE 353 Discrete Computatioal Structures Sprig 08 Sequeces, Mathematical Iductio, ad Recursio (Chapter 5, Epp) Note: some course slides adopted from publisher-provided material Overview May mathematical
More informationLecture 12: November 13, 2018
Mathematical Toolkit Autum 2018 Lecturer: Madhur Tulsiai Lecture 12: November 13, 2018 1 Radomized polyomial idetity testig We will use our kowledge of coditioal probability to prove the followig lemma,
More informationGeneralized Semi- Markov Processes (GSMP)
Geeralized Semi- Markov Processes (GSMP) Summary Some Defiitios Markov ad Semi-Markov Processes The Poisso Process Properties of the Poisso Process Iterarrival times Memoryless property ad the residual
More informationProblem Set 4 Due Oct, 12
EE226: Radom Processes i Systems Lecturer: Jea C. Walrad Problem Set 4 Due Oct, 12 Fall 06 GSI: Assae Gueye This problem set essetially reviews detectio theory ad hypothesis testig ad some basic otios
More informationStat410 Probability and Statistics II (F16)
Some Basic Cocepts of Statistical Iferece (Sec 5.) Suppose we have a rv X that has a pdf/pmf deoted by f(x; θ) or p(x; θ), where θ is called the parameter. I previous lectures, we focus o probability problems
More informationChapter 8: Estimating with Confidence
Chapter 8: Estimatig with Cofidece Sectio 8.2 The Practice of Statistics, 4 th editio For AP* STARNES, YATES, MOORE Chapter 8 Estimatig with Cofidece 8.1 Cofidece Itervals: The Basics 8.2 8.3 Estimatig
More informationModule 1 Fundamentals in statistics
Normal Distributio Repeated observatios that differ because of experimetal error ofte vary about some cetral value i a roughly symmetrical distributio i which small deviatios occur much more frequetly
More informationw (1) ˆx w (1) x (1) /ρ and w (2) ˆx w (2) x (2) /ρ.
2 5. Weighted umber of late jobs 5.1. Release dates ad due dates: maximimizig the weight of o-time jobs Oce we add release dates, miimizig the umber of late jobs becomes a sigificatly harder problem. For
More informationApproximations and more PMFs and PDFs
Approximatios ad more PMFs ad PDFs Saad Meimeh 1 Approximatio of biomial with Poisso Cosider the biomial distributio ( b(k,,p = p k (1 p k, k λ: k Assume that is large, ad p is small, but p λ at the limit.
More informationSimulation. Two Rule For Inverting A Distribution Function
Simulatio Two Rule For Ivertig A Distributio Fuctio Rule 1. If F(x) = u is costat o a iterval [x 1, x 2 ), the the uiform value u is mapped oto x 2 through the iversio process. Rule 2. If there is a jump
More informationOptimally Sparse SVMs
A. Proof of Lemma 3. We here prove a lower boud o the umber of support vectors to achieve geeralizatio bouds of the form which we cosider. Importatly, this result holds ot oly for liear classifiers, but
More informationNUMERICAL METHODS FOR SOLVING EQUATIONS
Mathematics Revisio Guides Numerical Methods for Solvig Equatios Page 1 of 11 M.K. HOME TUITION Mathematics Revisio Guides Level: GCSE Higher Tier NUMERICAL METHODS FOR SOLVING EQUATIONS Versio:. Date:
More informationDiscrete Mathematics for CS Spring 2007 Luca Trevisan Lecture 22
CS 70 Discrete Mathematics for CS Sprig 2007 Luca Trevisa Lecture 22 Aother Importat Distributio The Geometric Distributio Questio: A biased coi with Heads probability p is tossed repeatedly util the first
More informationAgnostic Learning and Concentration Inequalities
ECE901 Sprig 2004 Statistical Regularizatio ad Learig Theory Lecture: 7 Agostic Learig ad Cocetratio Iequalities Lecturer: Rob Nowak Scribe: Aravid Kailas 1 Itroductio 1.1 Motivatio I the last lecture
More informationCS537. Numerical Analysis and Computing
CS57 Numerical Aalysis ad Computig Lecture Locatig Roots o Equatios Proessor Ju Zhag Departmet o Computer Sciece Uiversity o Ketucky Leigto KY 456-6 Jauary 9 9 What is the Root May physical system ca be
More informationEstimation of the Mean and the ACVF
Chapter 5 Estimatio of the Mea ad the ACVF A statioary process {X t } is characterized by its mea ad its autocovariace fuctio γ ), ad so by the autocorrelatio fuctio ρ ) I this chapter we preset the estimators
More informationElement sampling: Part 2
Chapter 4 Elemet samplig: Part 2 4.1 Itroductio We ow cosider uequal probability samplig desigs which is very popular i practice. I the uequal probability samplig, we ca improve the efficiecy of the resultig
More informationBig Picture. 5. Data, Estimates, and Models: quantifying the accuracy of estimates.
5. Data, Estimates, ad Models: quatifyig the accuracy of estimates. 5. Estimatig a Normal Mea 5.2 The Distributio of the Normal Sample Mea 5.3 Normal data, cofidece iterval for, kow 5.4 Normal data, cofidece
More informationTable 12.1: Contingency table. Feature b. 1 N 11 N 12 N 1b 2 N 21 N 22 N 2b. ... a N a1 N a2 N ab
Sectio 12 Tests of idepedece ad homogeeity I this lecture we will cosider a situatio whe our observatios are classified by two differet features ad we would like to test if these features are idepedet
More informationElementary manipulations of probabilities
Elemetary maipulatios of probabilities Set probability of multi-valued r.v. {=Odd} = +3+5 = /6+/6+/6 = ½ X X,, X i j X i j Multi-variat distributio: Joit probability: X true true X X,, X X i j i j X X
More informationRandomized Algorithms I, Spring 2018, Department of Computer Science, University of Helsinki Homework 1: Solutions (Discussed January 25, 2018)
Radomized Algorithms I, Sprig 08, Departmet of Computer Sciece, Uiversity of Helsiki Homework : Solutios Discussed Jauary 5, 08). Exercise.: Cosider the followig balls-ad-bi game. We start with oe black
More information11 Hidden Markov Models
Hidde Markov Models Hidde Markov Models are a popular machie learig approach i bioiformatics. Machie learig algorithms are preseted with traiig data, which are used to derive importat isights about the
More informationDiscrete Mathematics and Probability Theory Spring 2013 Anant Sahai Lecture 18
EECS 70 Discrete Mathematics ad Probability Theory Sprig 2013 Aat Sahai Lecture 18 Iferece Oe of the major uses of probability is to provide a systematic framework to perform iferece uder ucertaity. A
More informationOptimization Methods MIT 2.098/6.255/ Final exam
Optimizatio Methods MIT 2.098/6.255/15.093 Fial exam Date Give: December 19th, 2006 P1. [30 pts] Classify the followig statemets as true or false. All aswers must be well-justified, either through a short
More informationLecture 2 October 11
Itroductio to probabilistic graphical models 203/204 Lecture 2 October Lecturer: Guillaume Oboziski Scribes: Aymeric Reshef, Claire Verade Course webpage: http://www.di.es.fr/~fbach/courses/fall203/ 2.
More informationPixel Recurrent Neural Networks
Pixel Recurret Neural Networks Aa ro va de Oord, Nal Kalchbreer, Koray Kavukcuoglu Google DeepMid August 2016 Preseter - Neha M Example problem (completig a image) Give the first half of the image, create
More informationPattern Classification
Patter Classificatio All materials i these slides were tae from Patter Classificatio (d ed) by R. O. Duda, P. E. Hart ad D. G. Stor, Joh Wiley & Sos, 000 with the permissio of the authors ad the publisher
More information