Probabilistic Unsupervised Learning

Similar documents
Probabilistic Unsupervised Learning

Latent Variable Models and EM Algorithm

Unsupervised Learning 2001

Clustering. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar.

Outline. CSCI-567: Machine Learning (Spring 2019) Outline. Prof. Victor Adamchik. Mar. 26, 2019

Chapter 12 EM algorithms The Expectation-Maximization (EM) algorithm is a maximum likelihood method for models that have hidden variables eg. Gaussian

Factor Analysis. Lecture 10: Factor Analysis and Principal Component Analysis. Sam Roweis

The Bayesian Learning Framework. Back to Maximum Likelihood. Naïve Bayes. Simple Example: Coin Tosses. Given a generative model

The Expectation-Maximization (EM) Algorithm

Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator

Expectation-Maximization Algorithm.

Mixtures of Gaussians and the EM Algorithm

Algorithms for Clustering

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

CSE 527, Additional notes on MLE & EM

Exponential Families and Bayesian Inference

Pattern Classification

Lecture 11 and 12: Basic estimation theory

8 : Learning Partially Observed GM: the EM algorithm

Regression and generalization

ECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015

Expectation maximization

This exam contains 19 pages (including this cover page) and 10 questions. A Formulae sheet is provided with the exam.

Chapter 6 Principles of Data Reduction

Axis Aligned Ellipsoid

Clustering: Mixture Models

Dimensionality Reduction vs. Clustering

Naïve Bayes. Naïve Bayes

Distributional Similarity Models (cont.)

BIOINF 585: Machine Learning for Systems Biology & Clinical Informatics

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5

Pattern Classification

MATH 320: Probability and Statistics 9. Estimation and Testing of Parameters. Readings: Pruim, Chapter 4

Latent Variable Models and EM algorithm

Table 12.1: Contingency table. Feature b. 1 N 11 N 12 N 1b 2 N 21 N 22 N 2b. ... a N a1 N a2 N ab

Distributional Similarity Models (cont.)

EE 6885 Statistical Pattern Recognition

Computing the maximum likelihood estimates: concentrated likelihood, EM-algorithm. Dmitry Pavlyuk

5 : Exponential Family and Generalized Linear Models

EECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1

Lecture 12: September 27

Lecture 2: Monte Carlo Simulation

1.010 Uncertainty in Engineering Fall 2008

Bayesian Methods: Introduction to Multi-parameter Models

Lecture 22: Review for Exam 2. 1 Basic Model Assumptions (without Gaussian Noise)

17. Joint distributions of extreme order statistics Lehmann 5.1; Ferguson 15

Probability and MLE.

Slide Set 13 Linear Model with Endogenous Regressors and the GMM estimator

6.867 Machine learning

Stat410 Probability and Statistics II (F16)

First Year Quantitative Comp Exam Spring, Part I - 203A. f X (x) = 0 otherwise

Classification with linear models

ADVANCED SOFTWARE ENGINEERING

n n i=1 Often we also need to estimate the variance. Below are three estimators each of which is optimal in some sense: n 1 i=1 k=1 i=1 k=1 i=1 k=1

Statistical Inference (Chapter 10) Statistical inference = learn about a population based on the information provided by a sample.

Chi-Squared Tests Math 6070, Spring 2006

Random Variables, Sampling and Estimation

AAEC/ECON 5126 FINAL EXAM: SOLUTIONS

Let us give one more example of MLE. Example 3. The uniform distribution U[0, θ] on the interval [0, θ] has p.d.f.

Introduction to Machine Learning DIS10

Machine Learning for Data Science (CS 4786)

Supplemental Material: Proofs

ECE 901 Lecture 13: Maximum Likelihood Estimation

Topic 9: Sampling Distributions of Estimators

Outline. L7: Probability Basics. Probability. Probability Theory. Bayes Law for Diagnosis. Which Hypothesis To Prefer? p(a,b) = p(b A) " p(a)

Topic 9: Sampling Distributions of Estimators

Machine Learning Brett Bernstein

Lecture 19: Convergence

Review Questions, Chapters 8, 9. f(y) = 0, elsewhere. F (y) = f Y(1) = n ( e y/θ) n 1 1 θ e y/θ = n θ e yn

Machine Learning Regression I Hamid R. Rabiee [Slides are based on Bishop Book] Spring

1 Inferential Methods for Correlation and Regression Analysis

5. Likelihood Ratio Tests

STAT331. Example of Martingale CLT with Cox s Model

Maximum Likelihood Estimation

Variable selection in principal components analysis of qualitative data using the accelerated ALS algorithm

Statistical Inference Based on Extremum Estimators

Lecture 3: MLE and Regression

Support vector machine revisited

Random Matrices with Blocks of Intermediate Scale Strongly Correlated Band Matrices

Lecture 4. Hw 1 and 2 will be reoped after class for every body. New deadline 4/20 Hw 3 and 4 online (Nima is lead)

Lecture 23: Minimal sufficiency

Empirical Processes: Glivenko Cantelli Theorems

Topic 9: Sampling Distributions of Estimators

Questions and Answers on Maximum Likelihood

6.883: Online Methods in Machine Learning Alexander Rakhlin

Lecture 7: Density Estimation: k-nearest Neighbor and Basis Approach

Simulation. Two Rule For Inverting A Distribution Function

Big Picture. 5. Data, Estimates, and Models: quantifying the accuracy of estimates.

STA Object Data Analysis - A List of Projects. January 18, 2018

CEE 522 Autumn Uncertainty Concepts for Geotechnical Engineering

CS284A: Representations and Algorithms in Molecular Biology

Element sampling: Part 2

CSIE/GINM, NTU 2009/11/30 1

Confidence Intervals for the Population Proportion p

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n.

Probability 2 - Notes 10. Lemma. If X is a random variable and g(x) 0 for all x in the support of f X, then P(g(X) 1) E[g(X)].

Outline. Linear regression. Regularization functions. Polynomial curve fitting. Stochastic gradient descent for regression. MLE for regression

Sequences A sequence of numbers is a function whose domain is the positive integers. We can see that the sequence

Infinite Sequences and Series

w (1) ˆx w (1) x (1) /ρ and w (2) ˆx w (2) x (2) /ρ.

Transcription:

Statistical Data Miig ad Machie Learig Hilary Term 2016 Dio Sejdiovic Departmet of Statistics Oxford Slides ad other materials available at: http://www.stats.ox.ac.u/~sejdiov/sdmml Probabilistic Methods Algorithmic approach: Data Probabilistic modellig approach: Uobserved process Algorithm Geerative Model Aalysis Iterpretatio Aalysis/ Iterpretatio Data Mixture models suppose that our dataset X was created by samplig iid from K distict populatios (called mixture compoets). Samples i populatio ca be modelled usig a distributio F µ with desity f (x µ ), where µ is the model parameter for the -th compoet. For a cocrete example, cosider a Gaussia with uow mea µ ad ow diagoal covariace σ 2 I, f (x µ ) 2πσ 2 p 2 exp ( 1 2σ 2 x µ 2 2 Geerative model: for i 1, 2,..., : First determie the assigmet variable idepedetly for each data item i: Z i Discrete(π 1,..., π K ) ). i.e., P(Z i ) π where mixig proportios / additioal model parameters are π 0 for each ad 1 π 1. Give the assigmet Z i, the X i (X (1) i (idepedetly) from the correspodig -th compoet:,..., X (p) i ) is sampled X i Z i f (x µ ) We observe X i x i for each i but ot Z i s (latet variables), ad would lie to ifer the parameters.

Uows to lear give data are Parameters: θ (π, µ ) K 1, where π 1,..., π K [0, 1], µ 1,..., µ K R p, ad Latet variables: z 1,..., z. The joit probability over all cluster idicator variables {Z i } are: : Joit pmf/pdf of observed ad latet variables Uows to lear give data are Parameters: θ (π, µ ) K 1, where π 1,..., π K [0, 1], µ 1,..., µ K R p, ad Latet variables: z 1,..., z. The joit probability mass fuctio/desity 1 is: p Z ((z i ) ) π zi K 1 π 1(z i) 11.2. Mixture models 339 The joit desity at observatios X i x i give Z i z i are: p(x i z i ) p(z i ) Name Sectio MVN Discrete Mixture of Gaussias K 11.2.1 Prod. pdiscrete X ((x i ) Discrete (Z i z i ) ) Mixture f (xof i µ multiomials zi ) f (x i µ ) 11.2.2 1(z i) Prod. Gaussia Prod. Gaussia Factor aalysis/ probabilistic PCA 12.1.5 1 Prod. Gaussia Prod. Laplace Probabilistic ICA/ sparse codig 12.6 Prod. Discrete Prod. Gaussia Multiomial PCA 27.2.3 Prod. Discrete Dirichlet Latet Dirichlet allocatio 27.3 Prod. Noisy-OR Prod. Beroulli BN20/ QMR 10.2.3 Prod. Beroulli Probabilistic Prod. Usupervised Beroulli Learig Sigmoid Mixturebelief Modelset 27.7 Mixture Table 11.1 Models: Gaussia Mixtures with Uequal Prod. Discrete i the lielihood meas a factored distributio of the form j Covariaces Gaussia meas a factored distributio of the form j Summary of some popular directed latet variable models. Here Prod meas product, so Cat(xij zi), ad Prod. N (xij zi). PCA stads for pricipal compoets aalysis. ICA stads for idepededet compoets aalysis. p X,Z ((x i, z i ) ) p Z ((z i ) )p X ((x i ) (Z i z i ) ) 1 K (π f (x i µ )) 1(z i) Ad the margial desity of x i (resultig model o the observed data) is: p(x i ) p(z i j, x i ) j1 π j f (x i µ j ). 1 I this course we will treat probability mass fuctios ad desities i the same way for otatioal simplicity. Strictly speaig, p X,Z is a desity with respect to the product base measure, where the base measure is the coutig measure for discrete variables ad Lebesgue for cotiuous variables. : Resposibility j1 0.8 0.7 0.6 0.5 0.4 0.3 Suppose we ow the parameters θ (π, µ ) K 1. Z i is a radom variable ad its coditioal distributio give data set X is: Q i : p(z i x i ) p(z i, x i ) p(x i ) π f (x i µ ) j1 π jf (x i µ j ) 0.2 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 (a) (b) figure from Murphy, 2012, Ch. 11. Here Figure θ 11.3(π A mixture of 3 Gaussias i 2d. (a) We show the cotours of costat probability for each, µ, Σ ) K 1 are all the model parametes ad compoet i the mixture. (b) A surface plot of the overall desity. Based o Figure 2.23 of (Bishop 2006a). ) Figure geerated by mixgaussplotdemo. ( f (x (µ, Σ )) (2π) p 2 Σ 1 2 exp 1 2 (x µ ) Σ 1 (x µ ), The coditioal probability Q i is called the resposibility of mixture compoet for data poit x i. These coditioals softly partitios the dataset amog the compoets: 1 Q i 1. 11.2.1 Mixtures of Gaussias p(x) π The most widely used mixture model f (x (µ is the mixture, Σ )) of Gaussias (MOG), also called a Gaussia 1 mixture model or GMM. I this model, each base distributio i the mixture is a multivariate Gaussia with mea μ ad covariace matrix Σ. Thus the model has the form

: Maximum Liehood : Maximum Liehood How ca we lear about the parameters θ (π, µ ) K 1 from data? Stadard statistical methodology ass for the maximum lielihood estimator (MLE). The goal is to maximise the margial probability of the data over the parameters Margial log-lielihood: l((π, µ ) K 1) : log p(x (π, µ ) K 1) log π f (x i µ ) 1 ˆθ ML argmax p(x θ) argmax θ (π,µ ) K 1 argmax (π,µ ) K 1 argmax p(x i (π, µ ) K 1) 1 (π,µ ) K 1 log π f (x i µ ) π f (x i µ ). 1 } {{ } :l((π,µ ) K 1 ) The gradiet w.r.t. µ : µ l((π, µ ) K 1) π f (x i µ ) j1 π jf (x i µ j ) µ log f (x i µ ) Q i µ log f (x i µ ). Difficult to solve, as Q i depeds implicitly o µ. Lielihood Surface for a Simple Example : Maximum Liehood 01 02 If latet variables z i s were all observed, we would have a uimodal lielihood surface but whe we margialise out the latets, the lielihood surface becomes multimodal: o uique MLE. 320 compbody.tex 35 19.5 Recall we would lie to solve: µ l((π, µ ) K 1) Q i µ log f (x i µ ) 0 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 30 25 20 15 10 5 0 30 20 10 0 10 20 30 (a) mu 2 14.5 9.5 4.5 0.5 5.5 10.5 15.5 15.5 10.5 5.5 0.5 4.5 9.5 14.5 19.5 mu 1 (left) 200 data poits from a mixture of two 1D Gaussias with Figure 11.6: Left: N 200 data poits sampled from a mixture of 2 Gaussias i 1d, with π 0.5, σ 5, µ 1 10 ad µ 2 10. Right: π 1 Lielihood π 2 surface 0.5, p(d µ σ 1, 5µ 2), ad withµ all 1 other 10, parameters µ 2 set 10. to their true values. We see the two symmetric modes, reflectig the uidetifiability of the parameters. Produced by mixgausslisurfacedemo. (right) Observed data log lielihood surface l (µ 1, µ 2 ), all the other parameters beig assumed ow. 11.4.2.5 Uidetifiability Note that mixture models are ot idetifiable, which meas there are may settigs of the parameters which have the same lielihood. Specifically, i a mixture model with K compoets, there are K! equivalet parameter settigs, which differ merely (b) What if we igore the depedece of Q i o the parameters? Taig the mixture of Gaussia with covariace σ 2 I as example, 1 σ 2 ( Q i µ p 2 log(2πσ2 ) 1 ) 2σ 2 x i µ 2 2 Q i (x i µ ) 1 σ 2 ( µ ML? Q ix i Q i Q i x i µ ( Q i) ) 0

: Maximum Liehood : Maximum Liehood The estimate is a weighted average of data poits, where the estimated mea of cluster uses its resposibilities to data poits as weights. µ ML? Q ix i Q. i Maes sese: Suppose we ew that data poit x i came from populatio z i. The Q izi 1 ad Q i 0 for z i ad: µ ML? i:z i x i i:z i 1 avg{x i : z i } Our best guess of the origiatig populatio is give by Q i. Soft K-Meas algorithm? Gradiet w.r.t. mixig proportio π (icludig a Lagrage multiplier λ ( π 1 ) to eforce costrait π 1). Note: ( π l((π, µ ) K 1) λ( ) K 1 π 1) 1 Q i f (x i µ ) j1 π jf (x i µ j ) λ Q i π λ 0 π 1 Q i }{{} 1 Q i π ML? Q i Agai maes sese: the estimate is simply (our best guess of) the proportio of data poits comig from populatio. : The Puttig all the derivatios together, we get a iterative algorithm for learig about the uows i the mixture model. Start with some iitial parameters (π (0), µ (0) ) K 1. Iterate for t 1, 2,...: Expectatio Step: Maximizatio Step: Q (t) i : π (t) Q(t) i Will the algorithm coverge? What does it coverge to? π (t 1) j1 π(t 1) j f (x i µ (t 1) ) f (x i µ (t 1) j ) µ (t) Q(t) i x i Q(t) i A example with 3 clusters. X2 X1

After 1st E ad M step. Iteratio 1 After 2d E ad M step. Iteratio 2 After 3rd E ad M step. Iteratio 3 After 4th E ad M step. Iteratio 4

After 5th E ad M step. Iteratio 5 I a maximum lielihood framewor, the objective fuctio is the log lielihood, l(θ) log π f (x i µ ) Direct maximisatio is ot feasible. Cosider aother objective fuctio F(θ, q), where q is ay probability distributio o latet variables z, such that: 1 F(θ, q) l(θ) for all θ, q, max F(θ, q) l(θ) q F(θ, q) is a lower boud o the log lielihood. We ca costruct a alteratig maximisatio algorithm as follows: For t 1, 2... util covergece: q (t) : argmax F(θ (t 1), q) q θ (t) : argmax F(θ, q (t) ) θ - Solvig for q The lower boud we use is called the variatioal free eergy. q is a probability mass fuctio for a distributio over z : (z i ). F(θ, q) E q [log p(x, z θ) log q(z)] [( ) ] E q 1(z i ) (log π + log f (x i µ )) log q(z) 1 [( ) ] q(z) 1(z i ) (log π + log f (x i µ )) log q(z) z 1 Lemma F(θ, q) l(θ) for all q ad for all θ. Lemma F(θ, q) l(θ) for q(z) p(z x, θ). I combiatio with previous Lemma, this implies that q(z) p(z x, θ) maximizes F(θ, q) for fixed θ, i.e., the optimal q is simply the coditioal distributio give the data ad that fixed θ. I mixture model, q (z) p(z, x θ) p(x θ) π z i f (x i µ zi ) z π z i f (x i µ z i ) π zi f (x i µ zi ) π f (x i µ ) p(z i x i, θ).

- Solvig for θ Settig derivative with respect to µ to 0, µ F(θ, q) q(z) 1(z i ) µ log f (x i µ ) z q(z i ) µ log f (x i µ ) 0 This equatio ca be solved quite easily. E.g., for mixture of Gaussias, µ q(z i )x i q(z i ) If it caot be solved exactly, we ca use gradiet ascet algorithm: µ µ + α q(z i ) µ log f (x i µ ). Similar derivatio for optimal π as before. Notes o Probabilistic Approach ad Start with some iitial parameters (π (0), µ (0) ) K 1. Iterate for t 1, 2,...: Expectatio Step: Theorem q (t) (z i ) : p(z i x i, θ (t 1) ) Maximizatio Step: π (t) q(t) (z i ) π (t 1) j1 π(t 1) j EM algorithm mootoically icreases the log lielihood. f (x i µ (t 1) ) f (x i µ (t 1) j ) µ (t) q(t) (z i )x i q(t) (z i ) Proof: l(θ (t 1) ) F(θ (t 1), q (t) ) F(θ (t), q (t) ) F(θ (t), q (t+1) ) l(θ (t) ). Additioal assumptio, that 2 θ F(θ(t), q (t) ) are egative defiite with eigevalues < ɛ < 0, implies that θ (t) θ where θ is a local MLE. Flexible Gaussia Some good thigs: Guarateed covergece to locally optimal parameters. Formal reasoig of ucertaities, usig both Bayes Theorem ad maximum lielihood theory. Rich laguage of probability theory to express a wide rage of geerative models, ad straightforward derivatio of algorithms for ML estimatio. Some bad thigs: Ca get stuc i local miima so multiple starts are recommeded. Slower ad more expesive tha K-meas. Choice of K still problematic, but rich array of methods for model selectio comes to rescue. We ca allow each cluster to have its ow mea ad covariace structure to eable greater flexibility i the model. Differet covariaces Idetical covariaces Differet, but diagoal covariaces Idetical ad spherical covariaces

PPCA latets A probabilistic model related to PCA has the followig geerative model: for i 1, 2,..., : Let <, p be give. Let Y i be a (latet) -dimesioal ormally distributed radom variable with 0 mea ad idetity covariace: Y i N (0, I ) We model the distributio of the ith data poit give Y i as a p-dimesioal ormal: X i N (µ + LY i, σ 2 I) where the parameters are a vector µ R p, a matrix L R p ad σ 2 > 0. pricipal subspace figures from M. Sahai s UCL course o Usupervised Learig PPCA latets PPCA latets PCA projectio PPCA oise PPCA latet prior pricipal subspace pricipal subspace figures from M. Sahai s UCL course o Usupervised Learig figures from M. Sahai s UCL course o Usupervised Learig

Mixture of s PPCA latets PPCA posterior PPCA oise PPCA latet prior PPCA projectio We have leart two types of usupervised learig techiques: Dimesioality reductio, e.g. PCA, MDS, Isomap. Clusterig, e.g. K-meas, liage ad mixture models. Probabilistic models allow us to costruct more complex models from simpler pieces. Mixture of probabilistic PCAs allows both clusterig ad dimesioality reductio at the same time. Z i Discrete(π 1,..., π K ) Y i N (0, I d ) X i Z i, Y i y i N (µ + Ly i, σ 2 I p ) pricipal subspace Allows flexible modellig of covariace structure without usig too may parameters. figures from M. Sahai s UCL course o Usupervised Learig Ghahramai ad Hito 1996 Further Readig Usupervised Learig Hastie et al, Chapter 14. James et al, Chapter 10. Ripley, Chapter 9. Tuey, Joh W. (1980). We eed both exploratory ad cofirmatory. The America Statisticia 34 (1): 23-25.