Unsupervised Learning 2001
|
|
- Bennett Rice
- 5 years ago
- Views:
Transcription
1 Usupervised Learig 2001 Lecture 3: The EM Algorithm Zoubi Ghahramai Carl Edward Rasmusse Gatsby Computatioal Neurosciece Uit MSc Itelliget Systems, Computer Sciece
2 The Expectatio Maximizatio (EM) algorithm Give a set of observed (visible) variables V, a set of uobserved (hidde) variables H, ad model parameters θ, optimize the log likelihood: L(θ) = log p(v θ) = log p(h, V θ)dh, (1) where we have writte the margial for the visibles i terms of a itegral over the joit distributio for hidde ad visible variables. Usig Jese s iequality for ay distributio of hidde states q(h) we have: L = log q(h) p(h, V θ) dh q(h) q(h) log p(h, V θ) dh = F(q, θ), (2) q(h) defiig the F(q, θ) fuctioal, which is a lower boud o the log likelihood. I the EM algorithm, we alterately optimize F(q, θ) wrt q ad θ, ad we ca prove that this will ever decrease L.
3 The E ad M steps of EM The lower boud o the log likelihood: F(q, θ) = where H(q) = q(h) log p(h, V θ) dh = q(h) q(h) log P (H, V θ)dh + H(q), (3) q(h) log q(h)dh is the etropy of q. We iteratively alterate: E step: optimize F(q, θ) wrt the distributio over hidde variables give the parameters: q (k) (H) := argmax q(h) F ( q(h), θ (k 1)). (4) M step: maximize F(q, θ) wrt the parameters give the hidde distributio: θ (k) := argmax θ F ( q (k) (H), θ ) = argmax θ q(h) log p(h, V θ)dh, (5) which is equivalet to optimizig the complete likelihood p(h, V θ), sice the etropy of q(h) does ot deped o θ.
4 EM as Coordiate Ascet i F
5 The EM algorithm ever decreases the log likelihood The differece betwee the cost fuctios: L(θ) F(q, θ) = log p(v θ) = log p(v θ) = q(h) log q(h) log q(h) log p(h, V θ) dh q(h) p(h V, θ)p(v θ) dh q(h) p(h V, θ) dh = KL ( q(h), p(h V, θ) ), q(h) (6) is called the Kullback-Liebler divergece; it is o-egative ad oly zero if ad oly if q(h) = p(h V, θ) (thus this is the E step). Although we are workig with the wrog cost fuctio, the likelihood is still icreased i every iteratio: L ( θ (k 1)) = F ( q (k), θ (k 1)) F ( q (k), θ (k)) L ( θ (k)), (7) where the first equatio holds because of the E step, ad the first iequality comes from the M step ad the fial iequality from Jese. Usually EM coverges to a local optimum of L (although there are exceptios).
6 The KL ( p(x), q(x) ) is o-egative ad zero iff x : p(x) = q(x) First let s cosider discrete distributios; the Kullback-Liebler divergece is: KL(p, q) = i q i log q i p i. (8) To fid the distributio q which miimizes KL(p, q) we add a lagrage multiplier to eforce the ormalizatio: E = KL(p, q) + λ(1 i q i ) = i q i log q i p i + λ(1 i q i ). (9) We the take partial derives ad set to zero: E = log(q i ) log(p i ) + 1 λ = 0 q i = p i exp(λ 1) q i E λ = 1 q i = 0 q i = 1 i i q i = p i. (10)
7 Why KL(p, q) is... Check that the curvature (Hessia) is positive (defiite), correspodig to a miimum: 2 E q i q i = 1 q i > 0, showig that q i = p i is a geuie miimum. KL(p, p) = 0. 2 E q i q j = 0, (11) At the miimum is it easily verified that A similar proof ca be doe for cotiuous distributios, the partial derivatives beig substituted by fuctioal derivatives.
8 The Gaussia mixture model (E-step) I the Gaussia mixture desity model, the deseties are give by: p(x θ) K k=1 π k exp( 1 σ k 2σk 2 (x µ k ) 2 ), (12) where θ is the collectio of parameters: meas µ k, variaces σk 2 ad mixig proportios π k (which must be positive ad sum to oe). There are (biary) hidde variables H (c) i, idicatig which compoet observatio x (c) belogs to. The coditioal likelihood ad priors are: p(x H, θ) = K k=1 H k σ 1 k exp( 1 2σk 2 (x µ k ) 2 ), ad p(h k θ) = π k. (13) I the E-step, compute the posterior for the hidde states give the curret variables: q(h) = p(h x, θ) = p(x H, θ)p (H θ) q(h (c) k ) π k exp( 1 σ k 2σk 2 (x (c) µ k ) 2 ) (14) with the ormalizatio beig q(h (c) k )/ k q(h(c) k ).
9 The Gaussia mixture model (M-step) I the M-step we optimize the sum (sice H is discrete): E = q(h) log[p(x H, θ)p(h θ)] = c,k q(h (c) k )[ log π k log σ k 1 2σk 2 (x (c) µ k ) 2]. (15) Optimizatio wrt. the parameters is doe by settig the partial derivatives of E to zero: E µ k = c E σ k E π k = c = c q(h (c) k )(x(c) µ k ) 2σ 2 k q(h (c) = 0 µ k = c q(h(c) k )x(c) c q(h(c) k ), k )[ 1 (x(c) µ k )] = 0 σ 2 σ k = k q(h (c) k ) 1 π k, σ 3 k E π k + λ = 0 π k = 1 c c q(h(c) k )(x(c) µ k ) 2 c q(h(c) k ), q(h (c) k ), (16) where λ is a Lagrage multiplier esurig that the mixig proportios sum to uity.
10 Factor Aalysis X 1 Y 1 Y 2 X K Y D Λ Liear geerative model: y d = K Λ dk x k + ɛ d k=1 x k are idepedet N (0, 1) Gaussia factors ɛ d are idepedet N (0, Ψ dd ) Gaussia oise K <D So, y is Gaussia with: P (y) = where Λ is a D K matrix, ad Ψ is diagoal. P (x)p (y x)dx = N (0, ΛΛ + Ψ) Dimesioality Reductio: Fids a low-dimesioal projectio of high dimesioal data that captures the correlatio structure of the data.
11 EM for Factor Aalysis X 1 X K Λ The model for y: P (y θ) = P (x θ)p (y x, θ)dx = N (0, ΛΛ + Ψ) Model parameters: θ = {Λ, Ψ}. Y 1 Y 2 Y D E step: For each data poit y, compute the posterior distributio of hidde factors give the observed data: Q (x) = P (x y, θ t ). M step: Fid the θ t+1 that maximises F(Q, θ): F(Q, θ) = Q (x) [log P (x θ) + log P (y x, θ) log Q (x)] dx = Q (x) [log P (x θ) + log P (y x, θ)] dx + cost.
12 The E step for Factor Aalysis E step: For each data poit y, compute the posterior distributio of hidde factors give the observed data: Q (x) = P (x y, θ t ) = P (x, y θ)/p (y θ) Tactic: write P (x, y θ), cosider y to be fixed. What is this as a fuctio of x? P (x, y ) = P (x)p (y x) = (2π) K 2 exp{ 1 2 x x} 2πΨ 1 2 exp{ 1 2 (y Λx) Ψ 1 (y Λx)} = cost exp{ 1 2 [x x + (y Λx) Ψ 1 (y Λx)]} = cost exp{ 1 2 [x (I + Λ Ψ 1 Λ)x 2x Λ Ψ 1 y ]} = cost exp{ 1 2 [x Σ 1 x 2x Σ 1 µ + µ Σ 1 µ]} So Σ = (I + Λ Ψ 1 Λ) 1 = I βλ ad µ = ΣΛ Ψ 1 y = βy. Note that µ is a liear fuctio of y ad Σ does ot deped o y.
13 The M step for Factor Aalysis M step: Fid θ t+1 maximisig F = Q (x) [log P (x θ) + log P (y x, θ)] dx + cost. log P (x θ)+ log P (y x, θ) = cost 1 2 x x 1 2 log Ψ 1 2 (y Λx) Ψ 1 (y Λx) = cost 1 2 log Ψ 1 2 [y Ψ 1 y 2y Ψ 1 Λx + x Λ Ψ 1 Λx] = cost 1 2 log Ψ 1 2 [y Ψ 1 y 2y Ψ 1 Λx + tr(λ Ψ 1 Λxx )] Takig expectatios over Q (x)... = cost 1 2 log Ψ 1 2 [y Ψ 1 y 2y Ψ 1 Λµ + tr(λ Ψ 1 Λ(µ µ + Σ))]
14 F = cost N 2 log Ψ 1 2 The M step for Factor Aalysis (cot.) [ y Ψ 1 y 2y Ψ 1 Λµ + tr(λ Ψ 1 Λ(µ µ + Σ)) ] Takig derivatives with respect to Λ ad Ψ 1, usig tr(ab) B = A ad F Λ = ( Ψ 1 y µ Ψ 1 Λ NΣ + ) µ µ = 0 ( ) 1 log A A = A : ˆΛ= ( y µ ) NΣ+ µ µ F Ψ 1 = N 2 Ψ 1 [ y y Λµ y y µ Λ + Λ(µ µ + Σ)Λ ] 2 ˆΨ = 1 N [ y y Λµ y y µ Λ + Λ(µ µ + Σ)Λ ] ˆΨ= ΛΣΛ + 1 N (y Λµ )(y Λµ ) (squared residuals) Whe Σ 0 these become the equatios for liear regressio!
15 Mixtures of Factor Aalysers Simultaeous clusterig ad dimesioality reductio. P (y θ) = k π k N (µ k, Λ k Λ k + Ψ) where π k is the mixig proportio for FA k, µ k is its cetre, Λ k is its factor loadig matrix, ad Ψ is a commo sesor oise model. θ = {{π k, µ k, Λ k } k=1...k, Ψ} We ca thik of this model as havig two sets of hidde latet variables: A discrete idicator variable s {1,... K} For each factor aalyzer, a cotious factor vector x,k R D k P (y θ) = K P (s θ) s =1 P (x s, θ)p (y x, s, θ) dx As before, a EM algorithm ca be derived for this model: E step: Ifer joit distributio of latet variables, P (x, s y, θ) M step: Maximize F with respect to θ.
16 Proof of the Matrix Iversio Lemma (A + XBX ) 1 = A 1 A 1 X(B 1 + X A 1 X) 1 X A 1 Need to prove: ( A 1 A 1 X(B 1 + X A 1 X) 1 X A 1) (A + XBX ) = I Expad: I + A 1 XBX A 1 X(B 1 + X A 1 X) 1 X A 1 X(B 1 + X A 1 X) 1 X A 1 XBX Regroup: = I + A 1 X = I + A 1 X = I + A 1 X (BX (B 1 + X A 1 X) 1 X (B 1 + X A 1 X) 1 X A 1 XBX ) (BX (B 1 + X A 1 X) 1 B 1 BX (B 1 + X A 1 X) 1 X A 1 XBX ) (BX (B 1 + X A 1 X) 1 (B 1 + X A 1 X)BX ) = I + A 1 X(BX BX ) = I
17 Readigs David MacKay s Textbook chapter 21, pages , draft 2.2.4, August 31, 2001 Ghahramai, Z. ad Hito, G.E. (1996) The EM Algorithm for Mixtures of Factor Aalyzers. Uiversity of Toroto Techical Report CRG-TR zoubi/papers/tr-96-1.ps.gz Mika, T. Tutorial o liear algebra. tpmika/papers/matrix.html Roweis, S.T. ad Ghahramai, Z. (1999) A Uifyig Review of Liear Gaussia Models. Neural Computatio 11(2). Sectios ad See also Appedix A.1-A.2. zoubi/papers/lds.ps.gz Wellig, M. (2000) Liear models. class otes. zoubi/course01/pca.ps
Week 3: The EM algorithm
Week 3: The EM algorithm Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit University College London Term 1, Autumn 2005 Mixtures of Gaussians Data: Y = {y 1... y N } Latent
More informationFactor Analysis. Lecture 10: Factor Analysis and Principal Component Analysis. Sam Roweis
Lecture 10: Factor Aalysis ad Pricipal Compoet Aalysis Sam Roweis February 9, 2004 Whe we assume that the subspace is liear ad that the uderlyig latet variable has a Gaussia distributio we get a model
More informationChapter 12 EM algorithms The Expectation-Maximization (EM) algorithm is a maximum likelihood method for models that have hidden variables eg. Gaussian
Chapter 2 EM algorithms The Expectatio-Maximizatio (EM) algorithm is a maximum likelihood method for models that have hidde variables eg. Gaussia Mixture Models (GMMs), Liear Dyamic Systems (LDSs) ad Hidde
More informationDimensionality Reduction vs. Clustering
Dimesioality Reductio vs. Clusterig Lecture 9: Cotiuous Latet Variable Models Sam Roweis Traiig such factor models (e.g. FA, PCA, ICA) is called dimesioality reductio. You ca thik of this as (o)liear regressio
More informationProbabilistic Unsupervised Learning
HT2015: SC4 Statistical Data Miig ad Machie Learig Dio Sejdiovic Departmet of Statistics Oxford http://www.stats.ox.ac.u/~sejdiov/sdmml.html Probabilistic Methods Algorithmic approach: Data Probabilistic
More informationProbabilistic Unsupervised Learning
Statistical Data Miig ad Machie Learig Hilary Term 2016 Dio Sejdiovic Departmet of Statistics Oxford Slides ad other materials available at: http://www.stats.ox.ac.u/~sejdiov/sdmml Probabilistic Methods
More informationClustering. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar.
Clusterig CM226: Machie Learig for Bioiformatics. Fall 216 Sriram Sakararama Ackowledgmets: Fei Sha, Ameet Talwalkar Clusterig 1 / 42 Admiistratio HW 1 due o Moday. Email/post o CCLE if you have questios.
More informationExpectation-Maximization Algorithm.
Expectatio-Maximizatio Algorithm. Petr Pošík Czech Techical Uiversity i Prague Faculty of Electrical Egieerig Dept. of Cyberetics MLE 2 Likelihood.........................................................................................................
More informationLecture 13: Maximum Likelihood Estimation
ECE90 Sprig 007 Statistical Learig Theory Istructor: R. Nowak Lecture 3: Maximum Likelihood Estimatio Summary of Lecture I the last lecture we derived a risk (MSE) boud for regressio problems; i.e., select
More informationThe Expectation-Maximization (EM) Algorithm
The Expectatio-Maximizatio (EM) Algorithm Readig Assigmets T. Mitchell, Machie Learig, McGraw-Hill, 997 (sectio 6.2, hard copy). S. Gog et al. Dyamic Visio: From Images to Face Recogitio, Imperial College
More informationAlgorithms for Clustering
CR2: Statistical Learig & Applicatios Algorithms for Clusterig Lecturer: J. Salmo Scribe: A. Alcolei Settig: give a data set X R p where is the umber of observatio ad p is the umber of features, we wat
More informationLecture 7: October 18, 2017
Iformatio ad Codig Theory Autum 207 Lecturer: Madhur Tulsiai Lecture 7: October 8, 207 Biary hypothesis testig I this lecture, we apply the tools developed i the past few lectures to uderstad the problem
More informationExponential Families and Bayesian Inference
Computer Visio Expoetial Families ad Bayesia Iferece Lecture Expoetial Families A expoetial family of distributios is a d-parameter family f(x; havig the followig form: f(x; = h(xe g(t T (x B(, (. where
More informationECE 901 Lecture 13: Maximum Likelihood Estimation
ECE 90 Lecture 3: Maximum Likelihood Estimatio R. Nowak 5/7/009 The focus of this lecture is to cosider aother approach to learig based o maximum likelihood estimatio. Ulike earlier approaches cosidered
More informationRegression and generalization
Regressio ad geeralizatio CE-717: Machie Learig Sharif Uiversity of Techology M. Soleymai Fall 2016 Curve fittig: probabilistic perspective Describig ucertaity over value of target variable as a probability
More informationLet us give one more example of MLE. Example 3. The uniform distribution U[0, θ] on the interval [0, θ] has p.d.f.
Lecture 5 Let us give oe more example of MLE. Example 3. The uiform distributio U[0, ] o the iterval [0, ] has p.d.f. { 1 f(x =, 0 x, 0, otherwise The likelihood fuctio ϕ( = f(x i = 1 I(X 1,..., X [0,
More informationOutline. CSCI-567: Machine Learning (Spring 2019) Outline. Prof. Victor Adamchik. Mar. 26, 2019
Outlie CSCI-567: Machie Learig Sprig 209 Gaussia mixture models Prof. Victor Adamchik 2 Desity estimatio U of Souther Califoria Mar. 26, 209 3 Naive Bayes Revisited March 26, 209 / 57 March 26, 209 2 /
More informationUnbiased Estimation. February 7-12, 2008
Ubiased Estimatio February 7-2, 2008 We begi with a sample X = (X,..., X ) of radom variables chose accordig to oe of a family of probabilities P θ where θ is elemet from the parameter space Θ. For radom
More informationMachine Learning for Data Science (CS 4786)
Machie Learig for Data Sciece CS 4786) Lecture & 3: Pricipal Compoet Aalysis The text i black outlies high level ideas. The text i blue provides simple mathematical details to derive or get to the algorithm
More informationEconomics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator
Ecoomics 24B Relatio to Method of Momets ad Maximum Likelihood OLSE as a Maximum Likelihood Estimator Uder Assumptio 5 we have speci ed the distributio of the error, so we ca estimate the model parameters
More information10-701/ Machine Learning Mid-term Exam Solution
0-70/5-78 Machie Learig Mid-term Exam Solutio Your Name: Your Adrew ID: True or False (Give oe setece explaatio) (20%). (F) For a cotiuous radom variable x ad its probability distributio fuctio p(x), it
More informationEE 6885 Statistical Pattern Recognition
EE 6885 Statistical Patter Recogitio Fall 5 Prof. Shih-Fu Chag http://www.ee.columbia.edu/~sfchag Lecture 6 (9/8/5 EE6887-Chag 6- Readig EM for Missig Features Textboo, DHS 3.9 Bayesia Parameter Estimatio
More informationLecture 11 and 12: Basic estimation theory
Lecture ad 2: Basic estimatio theory Sprig 202 - EE 94 Networked estimatio ad cotrol Prof. Kha March 2 202 I. MAXIMUM-LIKELIHOOD ESTIMATORS The maximum likelihood priciple is deceptively simple. Louis
More informationStatistics for Applications. Chapter 3: Maximum Likelihood Estimation 1/23
18.650 Statistics for Applicatios Chapter 3: Maximum Likelihood Estimatio 1/23 Total variatio distace (1) ( ) Let E,(IPθ ) θ Θ be a statistical model associated with a sample of i.i.d. r.v. X 1,...,X.
More informationChapter 6 Principles of Data Reduction
Chapter 6 for BST 695: Special Topics i Statistical Theory. Kui Zhag, 0 Chapter 6 Priciples of Data Reductio Sectio 6. Itroductio Goal: To summarize or reduce the data X, X,, X to get iformatio about a
More informationAlgebra of Least Squares
October 19, 2018 Algebra of Least Squares Geometry of Least Squares Recall that out data is like a table [Y X] where Y collects observatios o the depedet variable Y ad X collects observatios o the k-dimesioal
More informationMachine Learning Theory (CS 6783)
Machie Learig Theory (CS 6783) Lecture 2 : Learig Frameworks, Examples Settig up learig problems. X : istace space or iput space Examples: Computer Visio: Raw M N image vectorized X = 0, 255 M N, SIFT
More informationQuestions and answers, kernel part
Questios ad aswers, kerel part October 8, 205 Questios. Questio : properties of kerels, PCA, represeter theorem. [2 poits] Let F be a RK defied o some domai X, with feature map φ(x) x X ad reproducig kerel
More informationBayesian Methods: Introduction to Multi-parameter Models
Bayesia Methods: Itroductio to Multi-parameter Models Parameter: θ = ( θ, θ) Give Likelihood p(y θ) ad prior p(θ ), the posterior p proportioal to p(y θ) x p(θ ) Margial posterior ( θ, θ y) is Iterested
More informationLecture 12: September 27
36-705: Itermediate Statistics Fall 207 Lecturer: Siva Balakrisha Lecture 2: September 27 Today we will discuss sufficiecy i more detail ad the begi to discuss some geeral strategies for costructig estimators.
More informationLecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting
Lecture 6 Chi Square Distributio (χ ) ad Least Squares Fittig Chi Square Distributio (χ ) Suppose: We have a set of measuremets {x 1, x, x }. We kow the true value of each x i (x t1, x t, x t ). We would
More informationCSE 527, Additional notes on MLE & EM
CSE 57 Lecture Notes: MLE & EM CSE 57, Additioal otes o MLE & EM Based o earlier otes by C. Grat & M. Narasimha Itroductio Last lecture we bega a examiatio of model based clusterig. This lecture will be
More informationTMA4245 Statistics. Corrected 30 May and 4 June Norwegian University of Science and Technology Department of Mathematical Sciences.
Norwegia Uiversity of Sciece ad Techology Departmet of Mathematical Scieces Corrected 3 May ad 4 Jue Solutios TMA445 Statistics Saturday 6 May 9: 3: Problem Sow desity a The probability is.9.5 6x x dx
More informationDefinitions and Theorems. where x are the decision variables. c, b, and a are constant coefficients.
Defiitios ad Theorems Remember the scalar form of the liear programmig problem, Miimize, Subject to, f(x) = c i x i a 1i x i = b 1 a mi x i = b m x i 0 i = 1,2,, where x are the decisio variables. c, b,
More informationMaximum Likelihood Estimation and Complexity Regularization
ECE90 Sprig 004 Statistical Regularizatio ad Learig Theory Lecture: 4 Maximum Likelihood Estimatio ad Complexity Regularizatio Lecturer: Rob Nowak Scribe: Pam Limpiti Review : Maximum Likelihood Estimatio
More informationAdvanced Stochastic Processes.
Advaced Stochastic Processes. David Gamarik LECTURE 2 Radom variables ad measurable fuctios. Strog Law of Large Numbers (SLLN). Scary stuff cotiued... Outlie of Lecture Radom variables ad measurable fuctios.
More informationDistributional Similarity Models (cont.)
Sematic Similarity Vector Space Model Similarity Measures cosie Euclidea distace... Clusterig k-meas hierarchical Last Time EM Clusterig Soft versio of K-meas clusterig Iput: m dimesioal objects X = {
More informationLinear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d
Liear regressio Daiel Hsu (COMS 477) Maximum likelihood estimatio Oe of the simplest liear regressio models is the followig: (X, Y ),..., (X, Y ), (X, Y ) are iid radom pairs takig values i R d R, ad Y
More informationChapter Vectors
Chapter 4. Vectors fter readig this chapter you should be able to:. defie a vector. add ad subtract vectors. fid liear combiatios of vectors ad their relatioship to a set of equatios 4. explai what it
More informationCEE 522 Autumn Uncertainty Concepts for Geotechnical Engineering
CEE 5 Autum 005 Ucertaity Cocepts for Geotechical Egieerig Basic Termiology Set A set is a collectio of (mutually exclusive) objects or evets. The sample space is the (collectively exhaustive) collectio
More informationProbability, Random Variables and Random Processes
Appedix A robability, Radom Variables ad Radom rocesses I this appedix basic cocepts from probability, radom processes ad sigal theory are reviewed.. robability ad Radom Variables robability Space Ω F
More informationLecture 3: MLE and Regression
STAT/Q SCI 403: Itroductio to Resamplig Methods Sprig 207 Istructor: Ye-Chi Che Lecture 3: MLE ad Regressio 3. Parameters ad Distributios Some distributios are idexed by their uderlyig parameters. Thus,
More informationAxis Aligned Ellipsoid
Machie Learig for Data Sciece CS 4786) Lecture 6,7 & 8: Ellipsoidal Clusterig, Gaussia Mixture Models ad Geeral Mixture Models The text i black outlies high level ideas. The text i blue provides simple
More informationDistributional Similarity Models (cont.)
Distributioal Similarity Models (cot.) Regia Barzilay EECS Departmet MIT October 19, 2004 Sematic Similarity Vector Space Model Similarity Measures cosie Euclidea distace... Clusterig k-meas hierarchical
More information8 : Learning Partially Observed GM: the EM algorithm
10-708: Probabilistic Graphical Models, Sprig 2015 8 : Learig Partially Observed GM: the EM algorithm Lecturer: Eric P. Xig Scribes: Auric Qiao, Hao Zhag, Big Liu 1 Itroductio Two fudametal questios i
More informationIntroduction to Machine Learning DIS10
CS 189 Fall 017 Itroductio to Machie Learig DIS10 1 Fu with Lagrage Multipliers (a) Miimize the fuctio such that f (x,y) = x + y x + y = 3. Solutio: The Lagragia is: L(x,y,λ) = x + y + λ(x + y 3) Takig
More informationLecture 8: Graphical models for Text
Lecture 8: Graphical models for Text 4F13: Machine Learning Joaquin Quiñonero-Candela and Carl Edward Rasmussen Department of Engineering University of Cambridge http://mlg.eng.cam.ac.uk/teaching/4f13/
More informationStat410 Probability and Statistics II (F16)
Some Basic Cocepts of Statistical Iferece (Sec 5.) Suppose we have a rv X that has a pdf/pmf deoted by f(x; θ) or p(x; θ), where θ is called the parameter. I previous lectures, we focus o probability problems
More informationEECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1
EECS564 Estimatio, Filterig, ad Detectio Hwk 2 Sols. Witer 25 4. Let Z be a sigle observatio havig desity fuctio where. p (z) = (2z + ), z (a) Assumig that is a oradom parameter, fid ad plot the maximum
More informationLecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting
Lecture 6 Chi Square Distributio (χ ) ad Least Squares Fittig Chi Square Distributio (χ ) Suppose: We have a set of measuremets {x 1, x, x }. We kow the true value of each x i (x t1, x t, x t ). We would
More informationLecture 15: Learning Theory: Concentration Inequalities
STAT 425: Itroductio to Noparametric Statistics Witer 208 Lecture 5: Learig Theory: Cocetratio Iequalities Istructor: Ye-Chi Che 5. Itroductio Recall that i the lecture o classificatio, we have see that
More informationLecture 22: Review for Exam 2. 1 Basic Model Assumptions (without Gaussian Noise)
Lecture 22: Review for Exam 2 Basic Model Assumptios (without Gaussia Noise) We model oe cotiuous respose variable Y, as a liear fuctio of p umerical predictors, plus oise: Y = β 0 + β X +... β p X p +
More informationRandom Variables, Sampling and Estimation
Chapter 1 Radom Variables, Samplig ad Estimatio 1.1 Itroductio This chapter will cover the most importat basic statistical theory you eed i order to uderstad the ecoometric material that will be comig
More informationProbability 2 - Notes 10. Lemma. If X is a random variable and g(x) 0 for all x in the support of f X, then P(g(X) 1) E[g(X)].
Probability 2 - Notes 0 Some Useful Iequalities. Lemma. If X is a radom variable ad g(x 0 for all x i the support of f X, the P(g(X E[g(X]. Proof. (cotiuous case P(g(X Corollaries x:g(x f X (xdx x:g(x
More informationQuantile regression with multilayer perceptrons.
Quatile regressio with multilayer perceptros. S.-F. Dimby ad J. Rykiewicz Uiversite Paris 1 - SAMM 90 Rue de Tolbiac, 75013 Paris - Frace Abstract. We cosider oliear quatile regressio ivolvig multilayer
More informationStudy the bias (due to the nite dimensional approximation) and variance of the estimators
2 Series Methods 2. Geeral Approach A model has parameters (; ) where is ite-dimesioal ad is oparametric. (Sometimes, there is o :) We will focus o regressio. The fuctio is approximated by a series a ite
More informationOptimally Sparse SVMs
A. Proof of Lemma 3. We here prove a lower boud o the umber of support vectors to achieve geeralizatio bouds of the form which we cosider. Importatly, this result holds ot oly for liear classifiers, but
More informationLecture 7: Density Estimation: k-nearest Neighbor and Basis Approach
STAT 425: Itroductio to Noparametric Statistics Witer 28 Lecture 7: Desity Estimatio: k-nearest Neighbor ad Basis Approach Istructor: Ye-Chi Che Referece: Sectio 8.4 of All of Noparametric Statistics.
More informationEmpirical Process Theory and Oracle Inequalities
Stat 928: Statistical Learig Theory Lecture: 10 Empirical Process Theory ad Oracle Iequalities Istructor: Sham Kakade 1 Risk vs Risk See Lecture 0 for a discussio o termiology. 2 The Uio Boud / Boferoi
More information5.1 A mutual information bound based on metric entropy
Chapter 5 Global Fao Method I this chapter, we exted the techiques of Chapter 2.4 o Fao s method the local Fao method) to a more global costructio. I particular, we show that, rather tha costructig a local
More informationMachine Learning for Data Science (CS 4786)
Machie Learig for Data Sciece CS 4786) Lecture 9: Pricipal Compoet Aalysis The text i black outlies mai ideas to retai from the lecture. The text i blue give a deeper uderstadig of how we derive or get
More informationReview Questions, Chapters 8, 9. f(y) = 0, elsewhere. F (y) = f Y(1) = n ( e y/θ) n 1 1 θ e y/θ = n θ e yn
Stat 366 Lab 2 Solutios (September 2, 2006) page TA: Yury Petracheko, CAB 484, yuryp@ualberta.ca, http://www.ualberta.ca/ yuryp/ Review Questios, Chapters 8, 9 8.5 Suppose that Y, Y 2,..., Y deote a radom
More informationxn = x n 1 α f(xn 1 + β n) f(xn 1 β n)
Proceedigs of the 005 Witer Simulatio Coferece M E Kuhl, N M Steiger, F B Armstrog, ad J A Joies, eds BALANCING BIAS AND VARIANCE IN THE OPTIMIZATION OF SIMULATION MODELS Christie SM Currie School of Mathematics
More informationDirection: This test is worth 250 points. You are required to complete this test within 50 minutes.
Term Test October 3, 003 Name Math 56 Studet Number Directio: This test is worth 50 poits. You are required to complete this test withi 50 miutes. I order to receive full credit, aswer each problem completely
More informationMachine Learning Brett Bernstein
Machie Learig Brett Berstei Week 2 Lecture: Cocept Check Exercises Starred problems are optioal. Excess Risk Decompositio 1. Let X = Y = {1, 2,..., 10}, A = {1,..., 10, 11} ad suppose the data distributio
More informationBayes nets with tabular CPDs We have mostly focused on graphs where all latent nodes are discrete, and all CPDs/potentials are full tables.
Lecture 7: Liear Gaussia Models Bayes ets with tabular CPDs We have mostly focused o graphs where all latet odes are discrete, ad all CPDs/potetials are full tables. x X x 2 x X 2 x 4 x 2 X 4 x 6 X 6 x
More informationPhysics 324, Fall Dirac Notation. These notes were produced by David Kaplan for Phys. 324 in Autumn 2001.
Physics 324, Fall 2002 Dirac Notatio These otes were produced by David Kapla for Phys. 324 i Autum 2001. 1 Vectors 1.1 Ier product Recall from liear algebra: we ca represet a vector V as a colum vector;
More informationGrouping 2: Spectral and Agglomerative Clustering. CS 510 Lecture #16 April 2 nd, 2014
Groupig 2: Spectral ad Agglomerative Clusterig CS 510 Lecture #16 April 2 d, 2014 Groupig (review) Goal: Detect local image features (SIFT) Describe image patches aroud features SIFT, SURF, HoG, LBP, Group
More informationProblem Set 4 Due Oct, 12
EE226: Radom Processes i Systems Lecturer: Jea C. Walrad Problem Set 4 Due Oct, 12 Fall 06 GSI: Assae Gueye This problem set essetially reviews detectio theory ad hypothesis testig ad some basic otios
More informationECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization
ECE 90 Lecture 4: Maximum Likelihood Estimatio ad Complexity Regularizatio R Nowak 5/7/009 Review : Maximum Likelihood Estimatio We have iid observatios draw from a ukow distributio Y i iid p θ, i,, where
More informationLecture 19. sup y 1,..., yn B d n
STAT 06A: Polyomials of adom Variables Lecture date: Nov Lecture 19 Grothedieck s Iequality Scribe: Be Hough The scribes are based o a guest lecture by ya O Doell. I this lecture we prove Grothedieck s
More informationInformation Theory Tutorial Communication over Channels with memory. Chi Zhang Department of Electrical Engineering University of Notre Dame
Iformatio Theory Tutorial Commuicatio over Chaels with memory Chi Zhag Departmet of Electrical Egieerig Uiversity of Notre Dame Abstract A geeral capacity formula C = sup I(; Y ), which is correct for
More informationStatistical Inference Based on Extremum Estimators
T. Rotheberg Fall, 2007 Statistical Iferece Based o Extremum Estimators Itroductio Suppose 0, the true value of a p-dimesioal parameter, is kow to lie i some subset S R p : Ofte we choose to estimate 0
More informationExpectation maximization
Motivatio Expectatio maximizatio Subhrasu Maji CMSCI 689: Machie Learig 14 April 015 Suppose you are builig a aive Bayes spam classifier. After your are oe your boss tells you that there is o moey to label
More informationMASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 21 11/27/2013
MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 21 11/27/2013 Fuctioal Law of Large Numbers. Costructio of the Wieer Measure Cotet. 1. Additioal techical results o weak covergece
More informationMachine Learning Theory Tübingen University, WS 2016/2017 Lecture 12
Machie Learig Theory Tübige Uiversity, WS 06/07 Lecture Tolstikhi Ilya Abstract I this lecture we derive risk bouds for kerel methods. We will start by showig that Soft Margi kerel SVM correspods to miimizig
More informationTable 12.1: Contingency table. Feature b. 1 N 11 N 12 N 1b 2 N 21 N 22 N 2b. ... a N a1 N a2 N ab
Sectio 12 Tests of idepedece ad homogeeity I this lecture we will cosider a situatio whe our observatios are classified by two differet features ad we would like to test if these features are idepedet
More informationLecture 19: Convergence
Lecture 19: Covergece Asymptotic approach I statistical aalysis or iferece, a key to the success of fidig a good procedure is beig able to fid some momets ad/or distributios of various statistics. I may
More information[ 11 ] z of degree 2 as both degree 2 each. The degree of a polynomial in n variables is the maximum of the degrees of its terms.
[ 11 ] 1 1.1 Polyomial Fuctios 1 Algebra Ay fuctio f ( x) ax a1x... a1x a0 is a polyomial fuctio if ai ( i 0,1,,,..., ) is a costat which belogs to the set of real umbers ad the idices,, 1,...,1 are atural
More informationMaximum Likelihood Estimation
Chapter 9 Maximum Likelihood Estimatio 9.1 The Likelihood Fuctio The maximum likelihood estimator is the most widely used estimatio method. This chapter discusses the most importat cocepts behid maximum
More informationSDS 321: Introduction to Probability and Statistics
SDS 321: Itroductio to Probability ad Statistics Lecture 23: Cotiuous radom variables- Iequalities, CLT Puramrita Sarkar Departmet of Statistics ad Data Sciece The Uiversity of Texas at Austi www.cs.cmu.edu/
More informationBIOINF 585: Machine Learning for Systems Biology & Clinical Informatics
BIOINF 585: Machie Learig for Systems Biology & Cliical Iformatics Lecture 14: Dimesio Reductio Jie Wag Departmet of Computatioal Medicie & Bioiformatics Uiversity of Michiga 1 Outlie What is feature reductio?
More informationNANYANG TECHNOLOGICAL UNIVERSITY SYLLABUS FOR ENTRANCE EXAMINATION FOR INTERNATIONAL STUDENTS AO-LEVEL MATHEMATICS
NANYANG TECHNOLOGICAL UNIVERSITY SYLLABUS FOR ENTRANCE EXAMINATION FOR INTERNATIONAL STUDENTS AO-LEVEL MATHEMATICS STRUCTURE OF EXAMINATION PAPER. There will be oe 2-hour paper cosistig of 4 questios.
More informationNon-linear Feature Extraction by the Coordination of Mixture Models
No-liear Feature Extractio by the Coordiatio of Mixture Models J.J. Verbeek N. Vlassis B.J.A. Kröse Itelliget Autoomous Systems Group, Iformatics Istitute, Faculty of Sciece, Uiversity of Amsterdam, Kruislaa
More informationGeometry of LS. LECTURE 3 GEOMETRY OF LS, PROPERTIES OF σ 2, PARTITIONED REGRESSION, GOODNESS OF FIT
OCTOBER 7, 2016 LECTURE 3 GEOMETRY OF LS, PROPERTIES OF σ 2, PARTITIONED REGRESSION, GOODNESS OF FIT Geometry of LS We ca thik of y ad the colums of X as members of the -dimesioal Euclidea space R Oe ca
More informationMATH 320: Probability and Statistics 9. Estimation and Testing of Parameters. Readings: Pruim, Chapter 4
MATH 30: Probability ad Statistics 9. Estimatio ad Testig of Parameters Estimatio ad Testig of Parameters We have bee dealig situatios i which we have full kowledge of the distributio of a radom variable.
More information4.1 Data processing inequality
ECE598: Iformatio-theoretic methods i high-dimesioal statistics Sprig 206 Lecture 4: Total variatio/iequalities betwee f-divergeces Lecturer: Yihog Wu Scribe: Matthew Tsao, Feb 8, 206 [Ed. Mar 22] Recall
More informationACO Comprehensive Exam 9 October 2007 Student code A. 1. Graph Theory
1. Graph Theory Prove that there exist o simple plaar triagulatio T ad two distict adjacet vertices x, y V (T ) such that x ad y are the oly vertices of T of odd degree. Do ot use the Four-Color Theorem.
More informationThe Method of Least Squares. To understand least squares fitting of data.
The Method of Least Squares KEY WORDS Curve fittig, least square GOAL To uderstad least squares fittig of data To uderstad the least squares solutio of icosistet systems of liear equatios 1 Motivatio Curve
More information1 Duality revisited. AM 221: Advanced Optimization Spring 2016
AM 22: Advaced Optimizatio Sprig 206 Prof. Yaro Siger Sectio 7 Wedesday, Mar. 9th Duality revisited I this sectio, we will give a slightly differet perspective o duality. optimizatio program: f(x) x R
More informationTHE KALMAN FILTER RAUL ROJAS
THE KALMAN FILTER RAUL ROJAS Abstract. This paper provides a getle itroductio to the Kalma filter, a umerical method that ca be used for sesor fusio or for calculatio of trajectories. First, we cosider
More information1. Hydrogen Atom: 3p State
7633A QUANTUM MECHANICS I - solutio set - autum. Hydroge Atom: 3p State Let us assume that a hydroge atom is i a 3p state. Show that the radial part of its wave fuctio is r u 3(r) = 4 8 6 e r 3 r(6 r).
More informationEmpirical Processes: Glivenko Cantelli Theorems
Empirical Processes: Gliveko Catelli Theorems Mouliath Baerjee Jue 6, 200 Gliveko Catelli classes of fuctios The reader is referred to Chapter.6 of Weller s Torgo otes, Chapter??? of VDVW ad Chapter 8.3
More informationTopic 9: Sampling Distributions of Estimators
Topic 9: Samplig Distributios of Estimators Course 003, 2016 Page 0 Samplig distributios of estimators Sice our estimators are statistics (particular fuctios of radom variables), their distributio ca be
More informationThis exam contains 19 pages (including this cover page) and 10 questions. A Formulae sheet is provided with the exam.
Probability ad Statistics FS 07 Secod Sessio Exam 09.0.08 Time Limit: 80 Miutes Name: Studet ID: This exam cotais 9 pages (icludig this cover page) ad 0 questios. A Formulae sheet is provided with the
More informationLecture 1 Probability and Statistics
Wikipedia: Lecture 1 Probability ad Statistics Bejami Disraeli, British statesma ad literary figure (1804 1881): There are three kids of lies: lies, damed lies, ad statistics. popularized i US by Mark
More informationIntroduction to Optimization Techniques. How to Solve Equations
Itroductio to Optimizatio Techiques How to Solve Equatios Iterative Methods of Optimizatio Iterative methods of optimizatio Solutio of the oliear equatios resultig form a optimizatio problem is usually
More informationMathematical Statistics - MS
Paper Specific Istructios. The examiatio is of hours duratio. There are a total of 60 questios carryig 00 marks. The etire paper is divided ito three sectios, A, B ad C. All sectios are compulsory. Questios
More informationThe Bayesian Learning Framework. Back to Maximum Likelihood. Naïve Bayes. Simple Example: Coin Tosses. Given a generative model
Back to Maximum Likelihood Give a geerative model f (x, y = k) =π k f k (x) Usig a geerative modellig approach, we assume a parametric form for f k (x) =f (x; k ) ad compute the MLE θ of θ =(π k, k ) k=
More information