Unsupervised Learning 2001

Size: px
Start display at page:

Download "Unsupervised Learning 2001"

Transcription

1 Usupervised Learig 2001 Lecture 3: The EM Algorithm Zoubi Ghahramai Carl Edward Rasmusse Gatsby Computatioal Neurosciece Uit MSc Itelliget Systems, Computer Sciece

2 The Expectatio Maximizatio (EM) algorithm Give a set of observed (visible) variables V, a set of uobserved (hidde) variables H, ad model parameters θ, optimize the log likelihood: L(θ) = log p(v θ) = log p(h, V θ)dh, (1) where we have writte the margial for the visibles i terms of a itegral over the joit distributio for hidde ad visible variables. Usig Jese s iequality for ay distributio of hidde states q(h) we have: L = log q(h) p(h, V θ) dh q(h) q(h) log p(h, V θ) dh = F(q, θ), (2) q(h) defiig the F(q, θ) fuctioal, which is a lower boud o the log likelihood. I the EM algorithm, we alterately optimize F(q, θ) wrt q ad θ, ad we ca prove that this will ever decrease L.

3 The E ad M steps of EM The lower boud o the log likelihood: F(q, θ) = where H(q) = q(h) log p(h, V θ) dh = q(h) q(h) log P (H, V θ)dh + H(q), (3) q(h) log q(h)dh is the etropy of q. We iteratively alterate: E step: optimize F(q, θ) wrt the distributio over hidde variables give the parameters: q (k) (H) := argmax q(h) F ( q(h), θ (k 1)). (4) M step: maximize F(q, θ) wrt the parameters give the hidde distributio: θ (k) := argmax θ F ( q (k) (H), θ ) = argmax θ q(h) log p(h, V θ)dh, (5) which is equivalet to optimizig the complete likelihood p(h, V θ), sice the etropy of q(h) does ot deped o θ.

4 EM as Coordiate Ascet i F

5 The EM algorithm ever decreases the log likelihood The differece betwee the cost fuctios: L(θ) F(q, θ) = log p(v θ) = log p(v θ) = q(h) log q(h) log q(h) log p(h, V θ) dh q(h) p(h V, θ)p(v θ) dh q(h) p(h V, θ) dh = KL ( q(h), p(h V, θ) ), q(h) (6) is called the Kullback-Liebler divergece; it is o-egative ad oly zero if ad oly if q(h) = p(h V, θ) (thus this is the E step). Although we are workig with the wrog cost fuctio, the likelihood is still icreased i every iteratio: L ( θ (k 1)) = F ( q (k), θ (k 1)) F ( q (k), θ (k)) L ( θ (k)), (7) where the first equatio holds because of the E step, ad the first iequality comes from the M step ad the fial iequality from Jese. Usually EM coverges to a local optimum of L (although there are exceptios).

6 The KL ( p(x), q(x) ) is o-egative ad zero iff x : p(x) = q(x) First let s cosider discrete distributios; the Kullback-Liebler divergece is: KL(p, q) = i q i log q i p i. (8) To fid the distributio q which miimizes KL(p, q) we add a lagrage multiplier to eforce the ormalizatio: E = KL(p, q) + λ(1 i q i ) = i q i log q i p i + λ(1 i q i ). (9) We the take partial derives ad set to zero: E = log(q i ) log(p i ) + 1 λ = 0 q i = p i exp(λ 1) q i E λ = 1 q i = 0 q i = 1 i i q i = p i. (10)

7 Why KL(p, q) is... Check that the curvature (Hessia) is positive (defiite), correspodig to a miimum: 2 E q i q i = 1 q i > 0, showig that q i = p i is a geuie miimum. KL(p, p) = 0. 2 E q i q j = 0, (11) At the miimum is it easily verified that A similar proof ca be doe for cotiuous distributios, the partial derivatives beig substituted by fuctioal derivatives.

8 The Gaussia mixture model (E-step) I the Gaussia mixture desity model, the deseties are give by: p(x θ) K k=1 π k exp( 1 σ k 2σk 2 (x µ k ) 2 ), (12) where θ is the collectio of parameters: meas µ k, variaces σk 2 ad mixig proportios π k (which must be positive ad sum to oe). There are (biary) hidde variables H (c) i, idicatig which compoet observatio x (c) belogs to. The coditioal likelihood ad priors are: p(x H, θ) = K k=1 H k σ 1 k exp( 1 2σk 2 (x µ k ) 2 ), ad p(h k θ) = π k. (13) I the E-step, compute the posterior for the hidde states give the curret variables: q(h) = p(h x, θ) = p(x H, θ)p (H θ) q(h (c) k ) π k exp( 1 σ k 2σk 2 (x (c) µ k ) 2 ) (14) with the ormalizatio beig q(h (c) k )/ k q(h(c) k ).

9 The Gaussia mixture model (M-step) I the M-step we optimize the sum (sice H is discrete): E = q(h) log[p(x H, θ)p(h θ)] = c,k q(h (c) k )[ log π k log σ k 1 2σk 2 (x (c) µ k ) 2]. (15) Optimizatio wrt. the parameters is doe by settig the partial derivatives of E to zero: E µ k = c E σ k E π k = c = c q(h (c) k )(x(c) µ k ) 2σ 2 k q(h (c) = 0 µ k = c q(h(c) k )x(c) c q(h(c) k ), k )[ 1 (x(c) µ k )] = 0 σ 2 σ k = k q(h (c) k ) 1 π k, σ 3 k E π k + λ = 0 π k = 1 c c q(h(c) k )(x(c) µ k ) 2 c q(h(c) k ), q(h (c) k ), (16) where λ is a Lagrage multiplier esurig that the mixig proportios sum to uity.

10 Factor Aalysis X 1 Y 1 Y 2 X K Y D Λ Liear geerative model: y d = K Λ dk x k + ɛ d k=1 x k are idepedet N (0, 1) Gaussia factors ɛ d are idepedet N (0, Ψ dd ) Gaussia oise K <D So, y is Gaussia with: P (y) = where Λ is a D K matrix, ad Ψ is diagoal. P (x)p (y x)dx = N (0, ΛΛ + Ψ) Dimesioality Reductio: Fids a low-dimesioal projectio of high dimesioal data that captures the correlatio structure of the data.

11 EM for Factor Aalysis X 1 X K Λ The model for y: P (y θ) = P (x θ)p (y x, θ)dx = N (0, ΛΛ + Ψ) Model parameters: θ = {Λ, Ψ}. Y 1 Y 2 Y D E step: For each data poit y, compute the posterior distributio of hidde factors give the observed data: Q (x) = P (x y, θ t ). M step: Fid the θ t+1 that maximises F(Q, θ): F(Q, θ) = Q (x) [log P (x θ) + log P (y x, θ) log Q (x)] dx = Q (x) [log P (x θ) + log P (y x, θ)] dx + cost.

12 The E step for Factor Aalysis E step: For each data poit y, compute the posterior distributio of hidde factors give the observed data: Q (x) = P (x y, θ t ) = P (x, y θ)/p (y θ) Tactic: write P (x, y θ), cosider y to be fixed. What is this as a fuctio of x? P (x, y ) = P (x)p (y x) = (2π) K 2 exp{ 1 2 x x} 2πΨ 1 2 exp{ 1 2 (y Λx) Ψ 1 (y Λx)} = cost exp{ 1 2 [x x + (y Λx) Ψ 1 (y Λx)]} = cost exp{ 1 2 [x (I + Λ Ψ 1 Λ)x 2x Λ Ψ 1 y ]} = cost exp{ 1 2 [x Σ 1 x 2x Σ 1 µ + µ Σ 1 µ]} So Σ = (I + Λ Ψ 1 Λ) 1 = I βλ ad µ = ΣΛ Ψ 1 y = βy. Note that µ is a liear fuctio of y ad Σ does ot deped o y.

13 The M step for Factor Aalysis M step: Fid θ t+1 maximisig F = Q (x) [log P (x θ) + log P (y x, θ)] dx + cost. log P (x θ)+ log P (y x, θ) = cost 1 2 x x 1 2 log Ψ 1 2 (y Λx) Ψ 1 (y Λx) = cost 1 2 log Ψ 1 2 [y Ψ 1 y 2y Ψ 1 Λx + x Λ Ψ 1 Λx] = cost 1 2 log Ψ 1 2 [y Ψ 1 y 2y Ψ 1 Λx + tr(λ Ψ 1 Λxx )] Takig expectatios over Q (x)... = cost 1 2 log Ψ 1 2 [y Ψ 1 y 2y Ψ 1 Λµ + tr(λ Ψ 1 Λ(µ µ + Σ))]

14 F = cost N 2 log Ψ 1 2 The M step for Factor Aalysis (cot.) [ y Ψ 1 y 2y Ψ 1 Λµ + tr(λ Ψ 1 Λ(µ µ + Σ)) ] Takig derivatives with respect to Λ ad Ψ 1, usig tr(ab) B = A ad F Λ = ( Ψ 1 y µ Ψ 1 Λ NΣ + ) µ µ = 0 ( ) 1 log A A = A : ˆΛ= ( y µ ) NΣ+ µ µ F Ψ 1 = N 2 Ψ 1 [ y y Λµ y y µ Λ + Λ(µ µ + Σ)Λ ] 2 ˆΨ = 1 N [ y y Λµ y y µ Λ + Λ(µ µ + Σ)Λ ] ˆΨ= ΛΣΛ + 1 N (y Λµ )(y Λµ ) (squared residuals) Whe Σ 0 these become the equatios for liear regressio!

15 Mixtures of Factor Aalysers Simultaeous clusterig ad dimesioality reductio. P (y θ) = k π k N (µ k, Λ k Λ k + Ψ) where π k is the mixig proportio for FA k, µ k is its cetre, Λ k is its factor loadig matrix, ad Ψ is a commo sesor oise model. θ = {{π k, µ k, Λ k } k=1...k, Ψ} We ca thik of this model as havig two sets of hidde latet variables: A discrete idicator variable s {1,... K} For each factor aalyzer, a cotious factor vector x,k R D k P (y θ) = K P (s θ) s =1 P (x s, θ)p (y x, s, θ) dx As before, a EM algorithm ca be derived for this model: E step: Ifer joit distributio of latet variables, P (x, s y, θ) M step: Maximize F with respect to θ.

16 Proof of the Matrix Iversio Lemma (A + XBX ) 1 = A 1 A 1 X(B 1 + X A 1 X) 1 X A 1 Need to prove: ( A 1 A 1 X(B 1 + X A 1 X) 1 X A 1) (A + XBX ) = I Expad: I + A 1 XBX A 1 X(B 1 + X A 1 X) 1 X A 1 X(B 1 + X A 1 X) 1 X A 1 XBX Regroup: = I + A 1 X = I + A 1 X = I + A 1 X (BX (B 1 + X A 1 X) 1 X (B 1 + X A 1 X) 1 X A 1 XBX ) (BX (B 1 + X A 1 X) 1 B 1 BX (B 1 + X A 1 X) 1 X A 1 XBX ) (BX (B 1 + X A 1 X) 1 (B 1 + X A 1 X)BX ) = I + A 1 X(BX BX ) = I

17 Readigs David MacKay s Textbook chapter 21, pages , draft 2.2.4, August 31, 2001 Ghahramai, Z. ad Hito, G.E. (1996) The EM Algorithm for Mixtures of Factor Aalyzers. Uiversity of Toroto Techical Report CRG-TR zoubi/papers/tr-96-1.ps.gz Mika, T. Tutorial o liear algebra. tpmika/papers/matrix.html Roweis, S.T. ad Ghahramai, Z. (1999) A Uifyig Review of Liear Gaussia Models. Neural Computatio 11(2). Sectios ad See also Appedix A.1-A.2. zoubi/papers/lds.ps.gz Wellig, M. (2000) Liear models. class otes. zoubi/course01/pca.ps

Week 3: The EM algorithm

Week 3: The EM algorithm Week 3: The EM algorithm Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit University College London Term 1, Autumn 2005 Mixtures of Gaussians Data: Y = {y 1... y N } Latent

More information

Factor Analysis. Lecture 10: Factor Analysis and Principal Component Analysis. Sam Roweis

Factor Analysis. Lecture 10: Factor Analysis and Principal Component Analysis. Sam Roweis Lecture 10: Factor Aalysis ad Pricipal Compoet Aalysis Sam Roweis February 9, 2004 Whe we assume that the subspace is liear ad that the uderlyig latet variable has a Gaussia distributio we get a model

More information

Chapter 12 EM algorithms The Expectation-Maximization (EM) algorithm is a maximum likelihood method for models that have hidden variables eg. Gaussian

Chapter 12 EM algorithms The Expectation-Maximization (EM) algorithm is a maximum likelihood method for models that have hidden variables eg. Gaussian Chapter 2 EM algorithms The Expectatio-Maximizatio (EM) algorithm is a maximum likelihood method for models that have hidde variables eg. Gaussia Mixture Models (GMMs), Liear Dyamic Systems (LDSs) ad Hidde

More information

Dimensionality Reduction vs. Clustering

Dimensionality Reduction vs. Clustering Dimesioality Reductio vs. Clusterig Lecture 9: Cotiuous Latet Variable Models Sam Roweis Traiig such factor models (e.g. FA, PCA, ICA) is called dimesioality reductio. You ca thik of this as (o)liear regressio

More information

Probabilistic Unsupervised Learning

Probabilistic Unsupervised Learning HT2015: SC4 Statistical Data Miig ad Machie Learig Dio Sejdiovic Departmet of Statistics Oxford http://www.stats.ox.ac.u/~sejdiov/sdmml.html Probabilistic Methods Algorithmic approach: Data Probabilistic

More information

Probabilistic Unsupervised Learning

Probabilistic Unsupervised Learning Statistical Data Miig ad Machie Learig Hilary Term 2016 Dio Sejdiovic Departmet of Statistics Oxford Slides ad other materials available at: http://www.stats.ox.ac.u/~sejdiov/sdmml Probabilistic Methods

More information

Clustering. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar.

Clustering. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar. Clusterig CM226: Machie Learig for Bioiformatics. Fall 216 Sriram Sakararama Ackowledgmets: Fei Sha, Ameet Talwalkar Clusterig 1 / 42 Admiistratio HW 1 due o Moday. Email/post o CCLE if you have questios.

More information

Expectation-Maximization Algorithm.

Expectation-Maximization Algorithm. Expectatio-Maximizatio Algorithm. Petr Pošík Czech Techical Uiversity i Prague Faculty of Electrical Egieerig Dept. of Cyberetics MLE 2 Likelihood.........................................................................................................

More information

Lecture 13: Maximum Likelihood Estimation

Lecture 13: Maximum Likelihood Estimation ECE90 Sprig 007 Statistical Learig Theory Istructor: R. Nowak Lecture 3: Maximum Likelihood Estimatio Summary of Lecture I the last lecture we derived a risk (MSE) boud for regressio problems; i.e., select

More information

The Expectation-Maximization (EM) Algorithm

The Expectation-Maximization (EM) Algorithm The Expectatio-Maximizatio (EM) Algorithm Readig Assigmets T. Mitchell, Machie Learig, McGraw-Hill, 997 (sectio 6.2, hard copy). S. Gog et al. Dyamic Visio: From Images to Face Recogitio, Imperial College

More information

Algorithms for Clustering

Algorithms for Clustering CR2: Statistical Learig & Applicatios Algorithms for Clusterig Lecturer: J. Salmo Scribe: A. Alcolei Settig: give a data set X R p where is the umber of observatio ad p is the umber of features, we wat

More information

Lecture 7: October 18, 2017

Lecture 7: October 18, 2017 Iformatio ad Codig Theory Autum 207 Lecturer: Madhur Tulsiai Lecture 7: October 8, 207 Biary hypothesis testig I this lecture, we apply the tools developed i the past few lectures to uderstad the problem

More information

Exponential Families and Bayesian Inference

Exponential Families and Bayesian Inference Computer Visio Expoetial Families ad Bayesia Iferece Lecture Expoetial Families A expoetial family of distributios is a d-parameter family f(x; havig the followig form: f(x; = h(xe g(t T (x B(, (. where

More information

ECE 901 Lecture 13: Maximum Likelihood Estimation

ECE 901 Lecture 13: Maximum Likelihood Estimation ECE 90 Lecture 3: Maximum Likelihood Estimatio R. Nowak 5/7/009 The focus of this lecture is to cosider aother approach to learig based o maximum likelihood estimatio. Ulike earlier approaches cosidered

More information

Regression and generalization

Regression and generalization Regressio ad geeralizatio CE-717: Machie Learig Sharif Uiversity of Techology M. Soleymai Fall 2016 Curve fittig: probabilistic perspective Describig ucertaity over value of target variable as a probability

More information

Let us give one more example of MLE. Example 3. The uniform distribution U[0, θ] on the interval [0, θ] has p.d.f.

Let us give one more example of MLE. Example 3. The uniform distribution U[0, θ] on the interval [0, θ] has p.d.f. Lecture 5 Let us give oe more example of MLE. Example 3. The uiform distributio U[0, ] o the iterval [0, ] has p.d.f. { 1 f(x =, 0 x, 0, otherwise The likelihood fuctio ϕ( = f(x i = 1 I(X 1,..., X [0,

More information

Outline. CSCI-567: Machine Learning (Spring 2019) Outline. Prof. Victor Adamchik. Mar. 26, 2019

Outline. CSCI-567: Machine Learning (Spring 2019) Outline. Prof. Victor Adamchik. Mar. 26, 2019 Outlie CSCI-567: Machie Learig Sprig 209 Gaussia mixture models Prof. Victor Adamchik 2 Desity estimatio U of Souther Califoria Mar. 26, 209 3 Naive Bayes Revisited March 26, 209 / 57 March 26, 209 2 /

More information

Unbiased Estimation. February 7-12, 2008

Unbiased Estimation. February 7-12, 2008 Ubiased Estimatio February 7-2, 2008 We begi with a sample X = (X,..., X ) of radom variables chose accordig to oe of a family of probabilities P θ where θ is elemet from the parameter space Θ. For radom

More information

Machine Learning for Data Science (CS 4786)

Machine Learning for Data Science (CS 4786) Machie Learig for Data Sciece CS 4786) Lecture & 3: Pricipal Compoet Aalysis The text i black outlies high level ideas. The text i blue provides simple mathematical details to derive or get to the algorithm

More information

Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator

Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator Ecoomics 24B Relatio to Method of Momets ad Maximum Likelihood OLSE as a Maximum Likelihood Estimator Uder Assumptio 5 we have speci ed the distributio of the error, so we ca estimate the model parameters

More information

10-701/ Machine Learning Mid-term Exam Solution

10-701/ Machine Learning Mid-term Exam Solution 0-70/5-78 Machie Learig Mid-term Exam Solutio Your Name: Your Adrew ID: True or False (Give oe setece explaatio) (20%). (F) For a cotiuous radom variable x ad its probability distributio fuctio p(x), it

More information

EE 6885 Statistical Pattern Recognition

EE 6885 Statistical Pattern Recognition EE 6885 Statistical Patter Recogitio Fall 5 Prof. Shih-Fu Chag http://www.ee.columbia.edu/~sfchag Lecture 6 (9/8/5 EE6887-Chag 6- Readig EM for Missig Features Textboo, DHS 3.9 Bayesia Parameter Estimatio

More information

Lecture 11 and 12: Basic estimation theory

Lecture 11 and 12: Basic estimation theory Lecture ad 2: Basic estimatio theory Sprig 202 - EE 94 Networked estimatio ad cotrol Prof. Kha March 2 202 I. MAXIMUM-LIKELIHOOD ESTIMATORS The maximum likelihood priciple is deceptively simple. Louis

More information

Statistics for Applications. Chapter 3: Maximum Likelihood Estimation 1/23

Statistics for Applications. Chapter 3: Maximum Likelihood Estimation 1/23 18.650 Statistics for Applicatios Chapter 3: Maximum Likelihood Estimatio 1/23 Total variatio distace (1) ( ) Let E,(IPθ ) θ Θ be a statistical model associated with a sample of i.i.d. r.v. X 1,...,X.

More information

Chapter 6 Principles of Data Reduction

Chapter 6 Principles of Data Reduction Chapter 6 for BST 695: Special Topics i Statistical Theory. Kui Zhag, 0 Chapter 6 Priciples of Data Reductio Sectio 6. Itroductio Goal: To summarize or reduce the data X, X,, X to get iformatio about a

More information

Algebra of Least Squares

Algebra of Least Squares October 19, 2018 Algebra of Least Squares Geometry of Least Squares Recall that out data is like a table [Y X] where Y collects observatios o the depedet variable Y ad X collects observatios o the k-dimesioal

More information

Machine Learning Theory (CS 6783)

Machine Learning Theory (CS 6783) Machie Learig Theory (CS 6783) Lecture 2 : Learig Frameworks, Examples Settig up learig problems. X : istace space or iput space Examples: Computer Visio: Raw M N image vectorized X = 0, 255 M N, SIFT

More information

Questions and answers, kernel part

Questions and answers, kernel part Questios ad aswers, kerel part October 8, 205 Questios. Questio : properties of kerels, PCA, represeter theorem. [2 poits] Let F be a RK defied o some domai X, with feature map φ(x) x X ad reproducig kerel

More information

Bayesian Methods: Introduction to Multi-parameter Models

Bayesian Methods: Introduction to Multi-parameter Models Bayesia Methods: Itroductio to Multi-parameter Models Parameter: θ = ( θ, θ) Give Likelihood p(y θ) ad prior p(θ ), the posterior p proportioal to p(y θ) x p(θ ) Margial posterior ( θ, θ y) is Iterested

More information

Lecture 12: September 27

Lecture 12: September 27 36-705: Itermediate Statistics Fall 207 Lecturer: Siva Balakrisha Lecture 2: September 27 Today we will discuss sufficiecy i more detail ad the begi to discuss some geeral strategies for costructig estimators.

More information

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting Lecture 6 Chi Square Distributio (χ ) ad Least Squares Fittig Chi Square Distributio (χ ) Suppose: We have a set of measuremets {x 1, x, x }. We kow the true value of each x i (x t1, x t, x t ). We would

More information

CSE 527, Additional notes on MLE & EM

CSE 527, Additional notes on MLE & EM CSE 57 Lecture Notes: MLE & EM CSE 57, Additioal otes o MLE & EM Based o earlier otes by C. Grat & M. Narasimha Itroductio Last lecture we bega a examiatio of model based clusterig. This lecture will be

More information

TMA4245 Statistics. Corrected 30 May and 4 June Norwegian University of Science and Technology Department of Mathematical Sciences.

TMA4245 Statistics. Corrected 30 May and 4 June Norwegian University of Science and Technology Department of Mathematical Sciences. Norwegia Uiversity of Sciece ad Techology Departmet of Mathematical Scieces Corrected 3 May ad 4 Jue Solutios TMA445 Statistics Saturday 6 May 9: 3: Problem Sow desity a The probability is.9.5 6x x dx

More information

Definitions and Theorems. where x are the decision variables. c, b, and a are constant coefficients.

Definitions and Theorems. where x are the decision variables. c, b, and a are constant coefficients. Defiitios ad Theorems Remember the scalar form of the liear programmig problem, Miimize, Subject to, f(x) = c i x i a 1i x i = b 1 a mi x i = b m x i 0 i = 1,2,, where x are the decisio variables. c, b,

More information

Maximum Likelihood Estimation and Complexity Regularization

Maximum Likelihood Estimation and Complexity Regularization ECE90 Sprig 004 Statistical Regularizatio ad Learig Theory Lecture: 4 Maximum Likelihood Estimatio ad Complexity Regularizatio Lecturer: Rob Nowak Scribe: Pam Limpiti Review : Maximum Likelihood Estimatio

More information

Advanced Stochastic Processes.

Advanced Stochastic Processes. Advaced Stochastic Processes. David Gamarik LECTURE 2 Radom variables ad measurable fuctios. Strog Law of Large Numbers (SLLN). Scary stuff cotiued... Outlie of Lecture Radom variables ad measurable fuctios.

More information

Distributional Similarity Models (cont.)

Distributional Similarity Models (cont.) Sematic Similarity Vector Space Model Similarity Measures cosie Euclidea distace... Clusterig k-meas hierarchical Last Time EM Clusterig Soft versio of K-meas clusterig Iput: m dimesioal objects X = {

More information

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d Liear regressio Daiel Hsu (COMS 477) Maximum likelihood estimatio Oe of the simplest liear regressio models is the followig: (X, Y ),..., (X, Y ), (X, Y ) are iid radom pairs takig values i R d R, ad Y

More information

Chapter Vectors

Chapter Vectors Chapter 4. Vectors fter readig this chapter you should be able to:. defie a vector. add ad subtract vectors. fid liear combiatios of vectors ad their relatioship to a set of equatios 4. explai what it

More information

CEE 522 Autumn Uncertainty Concepts for Geotechnical Engineering

CEE 522 Autumn Uncertainty Concepts for Geotechnical Engineering CEE 5 Autum 005 Ucertaity Cocepts for Geotechical Egieerig Basic Termiology Set A set is a collectio of (mutually exclusive) objects or evets. The sample space is the (collectively exhaustive) collectio

More information

Probability, Random Variables and Random Processes

Probability, Random Variables and Random Processes Appedix A robability, Radom Variables ad Radom rocesses I this appedix basic cocepts from probability, radom processes ad sigal theory are reviewed.. robability ad Radom Variables robability Space Ω F

More information

Lecture 3: MLE and Regression

Lecture 3: MLE and Regression STAT/Q SCI 403: Itroductio to Resamplig Methods Sprig 207 Istructor: Ye-Chi Che Lecture 3: MLE ad Regressio 3. Parameters ad Distributios Some distributios are idexed by their uderlyig parameters. Thus,

More information

Axis Aligned Ellipsoid

Axis Aligned Ellipsoid Machie Learig for Data Sciece CS 4786) Lecture 6,7 & 8: Ellipsoidal Clusterig, Gaussia Mixture Models ad Geeral Mixture Models The text i black outlies high level ideas. The text i blue provides simple

More information

Distributional Similarity Models (cont.)

Distributional Similarity Models (cont.) Distributioal Similarity Models (cot.) Regia Barzilay EECS Departmet MIT October 19, 2004 Sematic Similarity Vector Space Model Similarity Measures cosie Euclidea distace... Clusterig k-meas hierarchical

More information

8 : Learning Partially Observed GM: the EM algorithm

8 : Learning Partially Observed GM: the EM algorithm 10-708: Probabilistic Graphical Models, Sprig 2015 8 : Learig Partially Observed GM: the EM algorithm Lecturer: Eric P. Xig Scribes: Auric Qiao, Hao Zhag, Big Liu 1 Itroductio Two fudametal questios i

More information

Introduction to Machine Learning DIS10

Introduction to Machine Learning DIS10 CS 189 Fall 017 Itroductio to Machie Learig DIS10 1 Fu with Lagrage Multipliers (a) Miimize the fuctio such that f (x,y) = x + y x + y = 3. Solutio: The Lagragia is: L(x,y,λ) = x + y + λ(x + y 3) Takig

More information

Lecture 8: Graphical models for Text

Lecture 8: Graphical models for Text Lecture 8: Graphical models for Text 4F13: Machine Learning Joaquin Quiñonero-Candela and Carl Edward Rasmussen Department of Engineering University of Cambridge http://mlg.eng.cam.ac.uk/teaching/4f13/

More information

Stat410 Probability and Statistics II (F16)

Stat410 Probability and Statistics II (F16) Some Basic Cocepts of Statistical Iferece (Sec 5.) Suppose we have a rv X that has a pdf/pmf deoted by f(x; θ) or p(x; θ), where θ is called the parameter. I previous lectures, we focus o probability problems

More information

EECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1

EECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1 EECS564 Estimatio, Filterig, ad Detectio Hwk 2 Sols. Witer 25 4. Let Z be a sigle observatio havig desity fuctio where. p (z) = (2z + ), z (a) Assumig that is a oradom parameter, fid ad plot the maximum

More information

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting Lecture 6 Chi Square Distributio (χ ) ad Least Squares Fittig Chi Square Distributio (χ ) Suppose: We have a set of measuremets {x 1, x, x }. We kow the true value of each x i (x t1, x t, x t ). We would

More information

Lecture 15: Learning Theory: Concentration Inequalities

Lecture 15: Learning Theory: Concentration Inequalities STAT 425: Itroductio to Noparametric Statistics Witer 208 Lecture 5: Learig Theory: Cocetratio Iequalities Istructor: Ye-Chi Che 5. Itroductio Recall that i the lecture o classificatio, we have see that

More information

Lecture 22: Review for Exam 2. 1 Basic Model Assumptions (without Gaussian Noise)

Lecture 22: Review for Exam 2. 1 Basic Model Assumptions (without Gaussian Noise) Lecture 22: Review for Exam 2 Basic Model Assumptios (without Gaussia Noise) We model oe cotiuous respose variable Y, as a liear fuctio of p umerical predictors, plus oise: Y = β 0 + β X +... β p X p +

More information

Random Variables, Sampling and Estimation

Random Variables, Sampling and Estimation Chapter 1 Radom Variables, Samplig ad Estimatio 1.1 Itroductio This chapter will cover the most importat basic statistical theory you eed i order to uderstad the ecoometric material that will be comig

More information

Probability 2 - Notes 10. Lemma. If X is a random variable and g(x) 0 for all x in the support of f X, then P(g(X) 1) E[g(X)].

Probability 2 - Notes 10. Lemma. If X is a random variable and g(x) 0 for all x in the support of f X, then P(g(X) 1) E[g(X)]. Probability 2 - Notes 0 Some Useful Iequalities. Lemma. If X is a radom variable ad g(x 0 for all x i the support of f X, the P(g(X E[g(X]. Proof. (cotiuous case P(g(X Corollaries x:g(x f X (xdx x:g(x

More information

Quantile regression with multilayer perceptrons.

Quantile regression with multilayer perceptrons. Quatile regressio with multilayer perceptros. S.-F. Dimby ad J. Rykiewicz Uiversite Paris 1 - SAMM 90 Rue de Tolbiac, 75013 Paris - Frace Abstract. We cosider oliear quatile regressio ivolvig multilayer

More information

Study the bias (due to the nite dimensional approximation) and variance of the estimators

Study the bias (due to the nite dimensional approximation) and variance of the estimators 2 Series Methods 2. Geeral Approach A model has parameters (; ) where is ite-dimesioal ad is oparametric. (Sometimes, there is o :) We will focus o regressio. The fuctio is approximated by a series a ite

More information

Optimally Sparse SVMs

Optimally Sparse SVMs A. Proof of Lemma 3. We here prove a lower boud o the umber of support vectors to achieve geeralizatio bouds of the form which we cosider. Importatly, this result holds ot oly for liear classifiers, but

More information

Lecture 7: Density Estimation: k-nearest Neighbor and Basis Approach

Lecture 7: Density Estimation: k-nearest Neighbor and Basis Approach STAT 425: Itroductio to Noparametric Statistics Witer 28 Lecture 7: Desity Estimatio: k-nearest Neighbor ad Basis Approach Istructor: Ye-Chi Che Referece: Sectio 8.4 of All of Noparametric Statistics.

More information

Empirical Process Theory and Oracle Inequalities

Empirical Process Theory and Oracle Inequalities Stat 928: Statistical Learig Theory Lecture: 10 Empirical Process Theory ad Oracle Iequalities Istructor: Sham Kakade 1 Risk vs Risk See Lecture 0 for a discussio o termiology. 2 The Uio Boud / Boferoi

More information

5.1 A mutual information bound based on metric entropy

5.1 A mutual information bound based on metric entropy Chapter 5 Global Fao Method I this chapter, we exted the techiques of Chapter 2.4 o Fao s method the local Fao method) to a more global costructio. I particular, we show that, rather tha costructig a local

More information

Machine Learning for Data Science (CS 4786)

Machine Learning for Data Science (CS 4786) Machie Learig for Data Sciece CS 4786) Lecture 9: Pricipal Compoet Aalysis The text i black outlies mai ideas to retai from the lecture. The text i blue give a deeper uderstadig of how we derive or get

More information

Review Questions, Chapters 8, 9. f(y) = 0, elsewhere. F (y) = f Y(1) = n ( e y/θ) n 1 1 θ e y/θ = n θ e yn

Review Questions, Chapters 8, 9. f(y) = 0, elsewhere. F (y) = f Y(1) = n ( e y/θ) n 1 1 θ e y/θ = n θ e yn Stat 366 Lab 2 Solutios (September 2, 2006) page TA: Yury Petracheko, CAB 484, yuryp@ualberta.ca, http://www.ualberta.ca/ yuryp/ Review Questios, Chapters 8, 9 8.5 Suppose that Y, Y 2,..., Y deote a radom

More information

xn = x n 1 α f(xn 1 + β n) f(xn 1 β n)

xn = x n 1 α f(xn 1 + β n) f(xn 1 β n) Proceedigs of the 005 Witer Simulatio Coferece M E Kuhl, N M Steiger, F B Armstrog, ad J A Joies, eds BALANCING BIAS AND VARIANCE IN THE OPTIMIZATION OF SIMULATION MODELS Christie SM Currie School of Mathematics

More information

Direction: This test is worth 250 points. You are required to complete this test within 50 minutes.

Direction: This test is worth 250 points. You are required to complete this test within 50 minutes. Term Test October 3, 003 Name Math 56 Studet Number Directio: This test is worth 50 poits. You are required to complete this test withi 50 miutes. I order to receive full credit, aswer each problem completely

More information

Machine Learning Brett Bernstein

Machine Learning Brett Bernstein Machie Learig Brett Berstei Week 2 Lecture: Cocept Check Exercises Starred problems are optioal. Excess Risk Decompositio 1. Let X = Y = {1, 2,..., 10}, A = {1,..., 10, 11} ad suppose the data distributio

More information

Bayes nets with tabular CPDs We have mostly focused on graphs where all latent nodes are discrete, and all CPDs/potentials are full tables.

Bayes nets with tabular CPDs We have mostly focused on graphs where all latent nodes are discrete, and all CPDs/potentials are full tables. Lecture 7: Liear Gaussia Models Bayes ets with tabular CPDs We have mostly focused o graphs where all latet odes are discrete, ad all CPDs/potetials are full tables. x X x 2 x X 2 x 4 x 2 X 4 x 6 X 6 x

More information

Physics 324, Fall Dirac Notation. These notes were produced by David Kaplan for Phys. 324 in Autumn 2001.

Physics 324, Fall Dirac Notation. These notes were produced by David Kaplan for Phys. 324 in Autumn 2001. Physics 324, Fall 2002 Dirac Notatio These otes were produced by David Kapla for Phys. 324 i Autum 2001. 1 Vectors 1.1 Ier product Recall from liear algebra: we ca represet a vector V as a colum vector;

More information

Grouping 2: Spectral and Agglomerative Clustering. CS 510 Lecture #16 April 2 nd, 2014

Grouping 2: Spectral and Agglomerative Clustering. CS 510 Lecture #16 April 2 nd, 2014 Groupig 2: Spectral ad Agglomerative Clusterig CS 510 Lecture #16 April 2 d, 2014 Groupig (review) Goal: Detect local image features (SIFT) Describe image patches aroud features SIFT, SURF, HoG, LBP, Group

More information

Problem Set 4 Due Oct, 12

Problem Set 4 Due Oct, 12 EE226: Radom Processes i Systems Lecturer: Jea C. Walrad Problem Set 4 Due Oct, 12 Fall 06 GSI: Assae Gueye This problem set essetially reviews detectio theory ad hypothesis testig ad some basic otios

More information

ECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization

ECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization ECE 90 Lecture 4: Maximum Likelihood Estimatio ad Complexity Regularizatio R Nowak 5/7/009 Review : Maximum Likelihood Estimatio We have iid observatios draw from a ukow distributio Y i iid p θ, i,, where

More information

Lecture 19. sup y 1,..., yn B d n

Lecture 19. sup y 1,..., yn B d n STAT 06A: Polyomials of adom Variables Lecture date: Nov Lecture 19 Grothedieck s Iequality Scribe: Be Hough The scribes are based o a guest lecture by ya O Doell. I this lecture we prove Grothedieck s

More information

Information Theory Tutorial Communication over Channels with memory. Chi Zhang Department of Electrical Engineering University of Notre Dame

Information Theory Tutorial Communication over Channels with memory. Chi Zhang Department of Electrical Engineering University of Notre Dame Iformatio Theory Tutorial Commuicatio over Chaels with memory Chi Zhag Departmet of Electrical Egieerig Uiversity of Notre Dame Abstract A geeral capacity formula C = sup I(; Y ), which is correct for

More information

Statistical Inference Based on Extremum Estimators

Statistical Inference Based on Extremum Estimators T. Rotheberg Fall, 2007 Statistical Iferece Based o Extremum Estimators Itroductio Suppose 0, the true value of a p-dimesioal parameter, is kow to lie i some subset S R p : Ofte we choose to estimate 0

More information

Expectation maximization

Expectation maximization Motivatio Expectatio maximizatio Subhrasu Maji CMSCI 689: Machie Learig 14 April 015 Suppose you are builig a aive Bayes spam classifier. After your are oe your boss tells you that there is o moey to label

More information

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 21 11/27/2013

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 21 11/27/2013 MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 21 11/27/2013 Fuctioal Law of Large Numbers. Costructio of the Wieer Measure Cotet. 1. Additioal techical results o weak covergece

More information

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12 Machie Learig Theory Tübige Uiversity, WS 06/07 Lecture Tolstikhi Ilya Abstract I this lecture we derive risk bouds for kerel methods. We will start by showig that Soft Margi kerel SVM correspods to miimizig

More information

Table 12.1: Contingency table. Feature b. 1 N 11 N 12 N 1b 2 N 21 N 22 N 2b. ... a N a1 N a2 N ab

Table 12.1: Contingency table. Feature b. 1 N 11 N 12 N 1b 2 N 21 N 22 N 2b. ... a N a1 N a2 N ab Sectio 12 Tests of idepedece ad homogeeity I this lecture we will cosider a situatio whe our observatios are classified by two differet features ad we would like to test if these features are idepedet

More information

Lecture 19: Convergence

Lecture 19: Convergence Lecture 19: Covergece Asymptotic approach I statistical aalysis or iferece, a key to the success of fidig a good procedure is beig able to fid some momets ad/or distributios of various statistics. I may

More information

[ 11 ] z of degree 2 as both degree 2 each. The degree of a polynomial in n variables is the maximum of the degrees of its terms.

[ 11 ] z of degree 2 as both degree 2 each. The degree of a polynomial in n variables is the maximum of the degrees of its terms. [ 11 ] 1 1.1 Polyomial Fuctios 1 Algebra Ay fuctio f ( x) ax a1x... a1x a0 is a polyomial fuctio if ai ( i 0,1,,,..., ) is a costat which belogs to the set of real umbers ad the idices,, 1,...,1 are atural

More information

Maximum Likelihood Estimation

Maximum Likelihood Estimation Chapter 9 Maximum Likelihood Estimatio 9.1 The Likelihood Fuctio The maximum likelihood estimator is the most widely used estimatio method. This chapter discusses the most importat cocepts behid maximum

More information

SDS 321: Introduction to Probability and Statistics

SDS 321: Introduction to Probability and Statistics SDS 321: Itroductio to Probability ad Statistics Lecture 23: Cotiuous radom variables- Iequalities, CLT Puramrita Sarkar Departmet of Statistics ad Data Sciece The Uiversity of Texas at Austi www.cs.cmu.edu/

More information

BIOINF 585: Machine Learning for Systems Biology & Clinical Informatics

BIOINF 585: Machine Learning for Systems Biology & Clinical Informatics BIOINF 585: Machie Learig for Systems Biology & Cliical Iformatics Lecture 14: Dimesio Reductio Jie Wag Departmet of Computatioal Medicie & Bioiformatics Uiversity of Michiga 1 Outlie What is feature reductio?

More information

NANYANG TECHNOLOGICAL UNIVERSITY SYLLABUS FOR ENTRANCE EXAMINATION FOR INTERNATIONAL STUDENTS AO-LEVEL MATHEMATICS

NANYANG TECHNOLOGICAL UNIVERSITY SYLLABUS FOR ENTRANCE EXAMINATION FOR INTERNATIONAL STUDENTS AO-LEVEL MATHEMATICS NANYANG TECHNOLOGICAL UNIVERSITY SYLLABUS FOR ENTRANCE EXAMINATION FOR INTERNATIONAL STUDENTS AO-LEVEL MATHEMATICS STRUCTURE OF EXAMINATION PAPER. There will be oe 2-hour paper cosistig of 4 questios.

More information

Non-linear Feature Extraction by the Coordination of Mixture Models

Non-linear Feature Extraction by the Coordination of Mixture Models No-liear Feature Extractio by the Coordiatio of Mixture Models J.J. Verbeek N. Vlassis B.J.A. Kröse Itelliget Autoomous Systems Group, Iformatics Istitute, Faculty of Sciece, Uiversity of Amsterdam, Kruislaa

More information

Geometry of LS. LECTURE 3 GEOMETRY OF LS, PROPERTIES OF σ 2, PARTITIONED REGRESSION, GOODNESS OF FIT

Geometry of LS. LECTURE 3 GEOMETRY OF LS, PROPERTIES OF σ 2, PARTITIONED REGRESSION, GOODNESS OF FIT OCTOBER 7, 2016 LECTURE 3 GEOMETRY OF LS, PROPERTIES OF σ 2, PARTITIONED REGRESSION, GOODNESS OF FIT Geometry of LS We ca thik of y ad the colums of X as members of the -dimesioal Euclidea space R Oe ca

More information

MATH 320: Probability and Statistics 9. Estimation and Testing of Parameters. Readings: Pruim, Chapter 4

MATH 320: Probability and Statistics 9. Estimation and Testing of Parameters. Readings: Pruim, Chapter 4 MATH 30: Probability ad Statistics 9. Estimatio ad Testig of Parameters Estimatio ad Testig of Parameters We have bee dealig situatios i which we have full kowledge of the distributio of a radom variable.

More information

4.1 Data processing inequality

4.1 Data processing inequality ECE598: Iformatio-theoretic methods i high-dimesioal statistics Sprig 206 Lecture 4: Total variatio/iequalities betwee f-divergeces Lecturer: Yihog Wu Scribe: Matthew Tsao, Feb 8, 206 [Ed. Mar 22] Recall

More information

ACO Comprehensive Exam 9 October 2007 Student code A. 1. Graph Theory

ACO Comprehensive Exam 9 October 2007 Student code A. 1. Graph Theory 1. Graph Theory Prove that there exist o simple plaar triagulatio T ad two distict adjacet vertices x, y V (T ) such that x ad y are the oly vertices of T of odd degree. Do ot use the Four-Color Theorem.

More information

The Method of Least Squares. To understand least squares fitting of data.

The Method of Least Squares. To understand least squares fitting of data. The Method of Least Squares KEY WORDS Curve fittig, least square GOAL To uderstad least squares fittig of data To uderstad the least squares solutio of icosistet systems of liear equatios 1 Motivatio Curve

More information

1 Duality revisited. AM 221: Advanced Optimization Spring 2016

1 Duality revisited. AM 221: Advanced Optimization Spring 2016 AM 22: Advaced Optimizatio Sprig 206 Prof. Yaro Siger Sectio 7 Wedesday, Mar. 9th Duality revisited I this sectio, we will give a slightly differet perspective o duality. optimizatio program: f(x) x R

More information

THE KALMAN FILTER RAUL ROJAS

THE KALMAN FILTER RAUL ROJAS THE KALMAN FILTER RAUL ROJAS Abstract. This paper provides a getle itroductio to the Kalma filter, a umerical method that ca be used for sesor fusio or for calculatio of trajectories. First, we cosider

More information

1. Hydrogen Atom: 3p State

1. Hydrogen Atom: 3p State 7633A QUANTUM MECHANICS I - solutio set - autum. Hydroge Atom: 3p State Let us assume that a hydroge atom is i a 3p state. Show that the radial part of its wave fuctio is r u 3(r) = 4 8 6 e r 3 r(6 r).

More information

Empirical Processes: Glivenko Cantelli Theorems

Empirical Processes: Glivenko Cantelli Theorems Empirical Processes: Gliveko Catelli Theorems Mouliath Baerjee Jue 6, 200 Gliveko Catelli classes of fuctios The reader is referred to Chapter.6 of Weller s Torgo otes, Chapter??? of VDVW ad Chapter 8.3

More information

Topic 9: Sampling Distributions of Estimators

Topic 9: Sampling Distributions of Estimators Topic 9: Samplig Distributios of Estimators Course 003, 2016 Page 0 Samplig distributios of estimators Sice our estimators are statistics (particular fuctios of radom variables), their distributio ca be

More information

This exam contains 19 pages (including this cover page) and 10 questions. A Formulae sheet is provided with the exam.

This exam contains 19 pages (including this cover page) and 10 questions. A Formulae sheet is provided with the exam. Probability ad Statistics FS 07 Secod Sessio Exam 09.0.08 Time Limit: 80 Miutes Name: Studet ID: This exam cotais 9 pages (icludig this cover page) ad 0 questios. A Formulae sheet is provided with the

More information

Lecture 1 Probability and Statistics

Lecture 1 Probability and Statistics Wikipedia: Lecture 1 Probability ad Statistics Bejami Disraeli, British statesma ad literary figure (1804 1881): There are three kids of lies: lies, damed lies, ad statistics. popularized i US by Mark

More information

Introduction to Optimization Techniques. How to Solve Equations

Introduction to Optimization Techniques. How to Solve Equations Itroductio to Optimizatio Techiques How to Solve Equatios Iterative Methods of Optimizatio Iterative methods of optimizatio Solutio of the oliear equatios resultig form a optimizatio problem is usually

More information

Mathematical Statistics - MS

Mathematical Statistics - MS Paper Specific Istructios. The examiatio is of hours duratio. There are a total of 60 questios carryig 00 marks. The etire paper is divided ito three sectios, A, B ad C. All sectios are compulsory. Questios

More information

The Bayesian Learning Framework. Back to Maximum Likelihood. Naïve Bayes. Simple Example: Coin Tosses. Given a generative model

The Bayesian Learning Framework. Back to Maximum Likelihood. Naïve Bayes. Simple Example: Coin Tosses. Given a generative model Back to Maximum Likelihood Give a geerative model f (x, y = k) =π k f k (x) Usig a geerative modellig approach, we assume a parametric form for f k (x) =f (x; k ) ad compute the MLE θ of θ =(π k, k ) k=

More information