Course 495: Advanced Statistical Machine Learning/Pattern Recognition

Similar documents
Manifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA

Data Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

Variational Principal Components

Naïve Bayes classification

Lecture 7: Con3nuous Latent Variable Models

Linear Regression and Discrimination

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Machine Learning Techniques for Computer Vision

Expectation Maximization

Exponential Family and Maximum Likelihood, Gaussian Mixture Models and the EM Algorithm. by Korbinian Schwinger

Variational Message Passing. By John Winn, Christopher M. Bishop Presented by Andy Miller

Latent Variable Models and EM algorithm

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Probabilistic Graphical Models

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Latent Variable Models and EM Algorithm

STA 414/2104: Machine Learning

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model

Unsupervised Learning

Introduction PCA classic Generative models Beyond and summary. PCA, ICA and beyond

Probabilistic Graphical Models

Expectation Maximization Algorithm

Probabilistic & Unsupervised Learning

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Mixtures of Robust Probabilistic Principal Component Analyzers

Recent Advances in Bayesian Inference Techniques

All you want to know about GPs: Linear Dimensionality Reduction

Bayesian Machine Learning

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Variational Autoencoders

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Graphical Models for Collaborative Filtering

Generative models for missing value completion

Introduction to Machine Learning

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Hidden Markov Bayesian Principal Component Analysis

Lecture : Probabilistic Machine Learning

Factor Analysis and Kalman Filtering (11/2/04)

Bayesian Machine Learning

Bayesian Machine Learning

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Machine Learning and Pattern Recognition, Tutorial Sheet Number 3 Answers

Introduction to Machine Learning

Lecture 6: April 19, 2002

Joint Factor Analysis for Speaker Verification

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

Least Squares Regression

Linear Models for Classification

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

CSC411: Final Review. James Lucas & David Madras. December 3, 2018

Lecture 4: Probabilistic Learning

Robust Probabilistic Projections

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan

Lecture 3: Pattern Classification

ECE 5984: Introduction to Machine Learning

Smart PCA. Yi Zhang Machine Learning Department Carnegie Mellon University

Sparse Bayesian Logistic Regression with Hierarchical Prior and Variational Inference

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Large-scale Collaborative Prediction Using a Nonparametric Random Effects Model

Posterior Regularization

PCA and admixture models

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Expectation Maximization

Advanced Introduction to Machine Learning

Clustering VS Classification

Machine Learning Lecture 5

Least Squares Regression

Heeyoul (Henry) Choi. Dept. of Computer Science Texas A&M University

Gaussian Mixture Models

December 20, MAA704, Multivariate analysis. Christopher Engström. Multivariate. analysis. Principal component analysis

p L yi z n m x N n xi

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Dimensionality reduction

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

Gaussian Process Regression

Probabilistic Graphical Models

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Multivariate Bayesian Linear Regression MLAI Lecture 11

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Linear Dynamical Systems (Kalman filter)

PATTERN RECOGNITION AND MACHINE LEARNING

CS281 Section 4: Factor Analysis and PCA

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

LATENT VARIABLE MODELS. Microsoft Research 7 J. J. Thomson Avenue, Cambridge CB3 0FB, U.K.

Unsupervised Learning

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

The Expectation-Maximization Algorithm

Dimension Reduction (PCA, ICA, CCA, FLD,

Lecture: Face Recognition and Feature Reduction

Principal Component Analysis

Physics 403. Segev BenZvi. Parameter Estimation, Correlations, and Error Bars. Department of Physics and Astronomy University of Rochester

Slides modified from: PATTERN RECOGNITION AND MACHINE LEARNING CHRISTOPHER M. BISHOP

PROBABILISTIC LATENT SEMANTIC ANALYSIS

Machine Learning 4771

Transcription:

Course 495: Advanced Statistical Machine Learning/Pattern Recognition Goal (Lecture): To present Probabilistic Principal Component Analysis (PPCA) using both Maximum Likelihood (ML) and Expectation Maximization (EM). Goal (Tutorials): To provide the students the necessary mathematical tools for deeply understanding PPCA. 1

Materials Pattern Recognition & Machine Learning by C. Bishop Chapter 12 PPCA: Tipping, Michael E., and Christopher M. Bishop. "Probabilistic principal component analysis." Journal of the Royal Statistical Society: Series B (Statistical Methodology) 61.3 (1999): 611-622 PPCA: Tipping, Michael E., and Christopher M. Bishop. "Mixtures of probabilistic principal component analyzers." Neural computation 11.2 (1999): 443-482. 2

An overview PCA: Maximize the global variance LDA: Minimize the class variance while maximizing the mean variance LPP: Minimize the local variance ICA: Maximize independence by maximizing non- Gaussianity All are deterministic!! 3

Advantages of PPCA over PCA An EM algorithm for PCA that is computationally efficient in situations where only a few leading eigenvectors are required and that avoids having to evaluate the data covariance matrix as an intermediate step. A combination of a probabilistic model and EM allows us to deal with missing values in the dataset. Missing values Missing values 4

Advantages of PPCA over PCA Mixtures of a probabilistic PCA models can be formulated in a principled way and trained using the EM algorithm. W 1 W 5 W 9 5

Advantages of PPCA over PCA Probabilistic PCA forms the basis for a Bayesian treatment of PCA in which the dimensionality of the principal subspace can be found automatically from the data. Priors on W:Automatic Relevance Determination (find the relevant components) p W a = d a i 2π d 2 exp { 1 2 a iw i T w i } 6

Advantages of PPCA over PCA Probabilistic PCA can be used to model class-conditional densities and hence be applied to classification problems. p e p e < p e ( ) 7

Advantages of PPCA over PCA The probabilistic PCA model can be run generatively to provide samples from the distribution. p e 8

Probabilistic Principal Component Analysis General Concept: y 1 y 2 y 3 y N Share a common linear structure x 1 x 2 x 3 x N x N We want to find the parameters: 9

Probabilistic Principal Component Analysis Graphical Model: σ 2 y i μ x i N W p(y) x 2 p(x y) w x 2 p(x) μ y w 10 y x = x 1 x 2 x 1 x 1

ML-PPCA p(x 1,.., x N, y 1,.., y N θ) = p(x i y i, W, μ, σ) p(y i ) N p x i y i, W, μ, σ = N(x i Wy i + μ, σ 2 ) Ν p y i = N(y i 0, I) p(x 1,.., x N θ = 11 y i p(x 1,.., x N, y 1,.., y N θ)dy 1 dy N = p(x i y i, W, μ, σ) p(y i )dy 1 dy N y i N Ν

ML-PPCA N p(x 1,.., x N θ = p(x i y i, W, μ, σ) p(y i )dy 1 dy N y i N Ν = p(x i y i, W, μ, σ)p(y i )dy i y i p x i y i, W, μ, σ p y i = 1 (2π) F σ 1 (2π) d e y i 1 e σ 2 x i μ Wy Τ i x i μ Wy i F T y i 12

ML-PPCA Complete the square: 1 σ 2 x i μ Wy i Τ x i μ Wy i + σ 2 y i T yi = x i x i = x i Wy i Τ x i Wy i + σ 2 y i T yi x i = x i μ = x i T x i 2x i T Wy i + y i T W T Wy i + σ 2 y i T yi = x i T x i 2x i T Wy i + y i T (σ 2 I + W T W)y i M 13

ML-PPCA = x i T x i 2x i T Wy i + y i T My i = x i T x i 2(M 1 W T x i ) T My i + y i T My i + (M 1 W T x i ) T M M 1 W T x i (M 1 W T x i ) T MM 1 W T x i = x i T (I WM 1 W T )x i + y i M 1 W T x i T M(y i M 1 W T x i ) 14

ML-PPCA y i p(x i y i, W, μ, σ)p(y i )dy i y i e 1 T σ 2x i (I WM 1 W T )x i 1 σ 2 y i M 1 W T T x i M(yi M 1 W T x i ) dyi e 1 σ 2x it (I WM 1 W T )x i e 1 σ 2 y i M 1 W T x i T M(yi M 1 W T x i ) dyi y i = N(x i μ, σ 2 I σ 2 WM 1 W T 1 ) 1 15

ML-PPCA Let s have a look at: σ 2 I σ 2 WM 1 W T = σ 2 I σ 2 W(σ 2 I + W T W ) 1 W T Woodbury formula: (A + UCV) 1 = A 1 A 1 U C 1 + VA 1 U 1 VA 1 Can you prove that if: D = WW T + σ 2 Ι then D 1 = σ 2 I σ 2 W(σ 2 I + W T W ) 1 W T 16

ML-PPCA p x i W, μ, σ 2 = p(x i y i, W, μ, σ)p(y i )dy i y i D = WW T + σ 2 Ι = N(x i μ, D) Using also the above we can easily show that the posterior : p y i x i, W, μ, σ 2 = p(y i ) p x i W,μ,σ 2 p x i y i, W, μ, σ 2 = = N(y i M 1 W T (x i μ), σ 2 Μ 1 ) 17

ML-PPCA p(x 1,.., x N θ = N N(x i μ, D) θ=argmax θ ln p(x 1,.., x N θ =argmax θ N ln N x i μ, D = NF 2 ln 2π N 2 ln D 1 2 N x i μ T D 1 (x i μ) 18

ML-PPCA L(W, σ 2, μ) = NF 2 ln 2π N 2 ln D Ntr[D 1 S t ] N where S t = 1 N x i μ x i μ Τ dl dμ = 0 μ = 1 Ν N x i 19

ML-PPCA dl dw = N D 1 S t D 1 W D 1 W S t D 1 W = W Three different solutions: (1) W = 0 which is the minimum of the likelihood (2) D = S t WW T + σ 2 Ι = S t Assume the eigen-decomposition S t = UΛU T U square matrix of eigenvectors R is a rotation matrix R T R = I W = U(Λ σ 2 Ι) 1/2 R 20

ML-PPCA (3) D S t and W 0 d < q = rank(s t ) Assume the SVD of W = ULV T U = [u 1 u d ] Fxd matrix U T U = I V T V = VV T = I L= l 1 0 0 l d S t D 1 ULV T = ULV T Let s study D 1 U 21

ML-PPCA D 1 = (WW T + σ 2 Ι ) 1 W=ULV T = (UL 2 U T + σ 2 Ι ) 1 Assume a set of bases U F d such that U F d T U = 0 and U F d T U F d = I 22 = [U U F d ] L2 0 0 0 [U U F d] T + U U F d σ 2 Ι[U U F d ] T = [U U F d ] L2 + σ 2 Ι 0 0 σ 2 [U U F d ] T Ι = [U U F d ] (L2 + σ 2 Ι) 1 0 0 σ 2 Ι [U U F d] T 1 1

ML-PPCA D 1 U = [U U F d ] (L2 + σ 2 Ι) 1 0 0 σ 2 Ι [U U F d] T U = [U U F d ] (L2 + σ 2 Ι) 1 0 0 σ 2 Ι [I 0]T = U U F d (L 2 + σ 2 Ι) 1 0 = U(L 2 + σ 2 Ι) 1 23

ML-PPCA S t D 1 ULV T = ULV T S t U(L 2 + σ 2 Ι) 1 = U S t U = U(L 2 + σ 2 Ι) 1 It means that S t u i = l i 2 + σ 2 u i S t = UΛU T u i are eigenvectors of S t and λ ι = l i 2 + σ 2 l ι = λ ι σ 2 Unfortunately V cannot be determined hence there is a rotation ambiguity. 24

ML-PPCA Hence the optimum is given by (keeping d eigenvectors) W d = U d (Λ d σ 2 Ι)V T Having computed W we need to compute the optimum σ 2 L(W, σ 2, μ) = NF 2 ln 2π N 2 ln D Ntr[D 1 S t ] = NF 2 ln 2π N 2 ln WWT + σ 2 I Ntr[(WW T + σ 2 I ) 1 S t ] 25

W d W d T + σ 2 I = U d U F d Λ d σ 2 I 0 0 0 U du F d T 26 ML-PPCA Hence U d U F d + U d U F d σ 2 0 0 σ 2 Λ = U d U d 0 F d 0 σ 2 I W d W d T + σ 2 I = ln W d W d T + σ 2 I =(F d)lnσ 2 + d λ i U d U F d T F i=d+1 q σ 2 lnλ i T

ML-PPCA 1 S t = U d U F d Λ d 0 0 σ 2 I = U d U F d 0 1 σ 2 Λ q d 0 I 0 0 0 0 0 U d U T F d U d U F d 0 Λ q d 0 1 Λ d 0 0 T U d U F d 0 0 0 U d U F d T 27 tr D 1 S t = 1 σ 2 λ i + d q i=d+1

ML-PPCA L σ 2 = N 2 d Fln 2π + ln λ j + 1 σ 2 λ j j=1 L σ = 0 2σ 3 λ j q j=d+1 If I put the solution back + F d σ q j=d+1 = 0 σ 2 = 1 F d + F d lnσ 2 + d q j=d+1 λ j L σ 2 = N 2 d j=1 ln λ j + F q ln 1 F d F j=d+1 λ j + Fln2π + F 28

ML-PPCA d q q L σ 2 = N 2 { ln λ j + ln λ j ln λ j j=1 ln S t j=d j=d + F d ln 1 F d F j=d+1 λ j + Fln2π + F} max N 2 1 F d ln S t 1 F d q j=d ln λ j + ln 1 F d F j=d+1 ln λ j + cost min ln 1 F d q j=d ln λ j 1 F d q j=d ln λ j 29

ML-PPCA Jensen inequality ln n n r i 1 n n ln r i ln 1 F d q j=d+1 λ j 1 F d q j=d+1 lnλ j hence ln 1 F d q j=d ln λ j 1 F d q j=d ln λ j 0 Hence, the function is minimized when the discarded eigenvectors are the ones that correspond to the q d eigenvalues 30

ML-PPCA Summarize: σ 2 = 1 F d q j=d+1 λ j We no longer have a projection but: W d = U d (Λ d σ 2 Ι)V T μ = 1 Ν N x i E p(yi x i ){y i } = M 1 W T (x i μ) and a reconstruction x i = WE p(yi x i ) y i + μ 31

ML-PPCA lim W d = U d Λ d σ 2 0 lim M = W T d W d σ 2 0 Hence lim E p(yi x i ){y i } = M 1 W T d (x i μ) σ 2 0 = Λ d 1 U d (x i μ) which gives the whitened PCA 32

EM-PPCA First step formulate the joint likelihood p X, Y θ = p X, Y θ p Y = p( x i y i, θ p(y i ) N N lnp X, Z θ = lnp(x i y i, θ + lnp(y i ) 1 lnp(x i y i, θ = ln (2π) F (σ 2 ) = F 2 ln 2πσ2 1 2σ 2 lnp y i = 1 2 y i T y i D 2 ln2π 1 e 2σ 2 x i Wy i μ T (x i Wy i μ) F x i Wy i μ T (x i Wy i μ) 33

EM-PPCA ln p X, Y θ N = { F 2 ln2πσ2 1 2σ 2 μ) 1 2 y i T y i D 2 ln2π} x i Wy i μ T (x i Wy i I need now to optimize it with regards to y i, W, μ, σ 2 We can t hence we need to take the expectations 34

EM-PPCA But first we need to expand a bit: ln p X, Y θ = NF 2 ln2πσ2 ND 2 ln2π N 1 2σ 2 [ x i μ T x i μ 2 W T x i T y i + y i T W T Wy i ] + 1 2 y i T y i 35

EM-PPCA E P(Y X) {log P(X,Y θ)}= NF 2 ln2πσ2 ND 2 ln2π N { 1 2σ 2 x i μ T x i μ 2 W T x i T E{y i } + tr[e{y i y i T }W T W

EM-PPCA E{y i } = y i p y i x i, θ dy n = Ε{N y i M 1 W T x i μ, σ 2 M 1 } = M 1 W T (x i μ) E y i E y i y i E y T i = = E{ y i y T i } E{E{y i }y i } T E{y i E y T i } + E y i E{y i } T = E{y i y T i } E{y i }E{y i } T E y i E y T i + E y i E{y i } T = E{y i y T i } E{y i }E{y i } T = σ 2 M 1 hence E y i y i T = σ 2 M 1 + E{y i }E{y i } T 37

EM-PPCA Expectation Step: Given W, μ, σ: E{y i } = M 1 W T (x i μ) E y i y i T = σ 2 M 1 + E{y i }E{y i } T 38

EM-PPCA Maximization step E{μ} μ = 0 μ = 1 N N x i N E{W} W = 0 1 σ 2 x i μ E y T i + 2 2σ 2 WE{y iy T i } 1 N N W = x i μ E{y i } T E{y i y T i } = 0 39

EM-PPCA E{σ 2 } σ = 0 σ 2 = 1 NF N { x i μ 2 2E y T i W T x i μ + tr E y i y T i W T W } 40

Why EM-PPCA? A complexity of O(NFd) which can be significant smaller than O(NF 2 ) (computation of the covariance) A combination of a probabilistic model and EM allows us to deal with missing values in the dataset. Missing values Missing values 41

Why EM-PPCA? A combination of a probabilistic model and EM allows us to deal with missing values in the dataset. p x i y i, W 1, μ 1, σ p x i y i, W 9, μ 9, σ 42

Summary We saw how to perform parameter estimation and inference using ML and EM in the following graphical model σ 2 y i μ x i N W 43

What will we see next? Dynamic data y 1 y 2 y 3 y T x 1 x 2 x 3 x T EM in the following graphical models Spatial data y u+1v y u+1v+1 y uv x uv x u+1v x u+1v+1 y uv+1 x uv+1 44