Cheng Soon Ong & Christian Walder. Canberra February June 2018

Similar documents
Cheng Soon Ong & Christian Walder. Canberra February June 2018

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

Mixtures of Gaussians. Sargur Srihari

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2017

Latent Variable Models and Expectation Maximization

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Latent Variable Models and Expectation Maximization

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Expectation Maximization

Mixture Models and EM

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Clustering and Gaussian Mixtures

Cheng Soon Ong & Christian Walder. Canberra February June 2018

CSCI-567: Machine Learning (Spring 2019)

Probabilistic & Unsupervised Learning

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Lecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models

Lecture 4: Probabilistic Learning

Logistic Regression. COMP 527 Danushka Bollegala

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Curve Fitting Re-visited, Bishop1.2.5

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Probabilistic Reasoning in Deep Learning

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Data Mining Techniques

Statistical Pattern Recognition

Clustering with k-means and Gaussian mixture distributions

Mathematical Formulation of Our Example

Pattern Recognition and Machine Learning

Clustering and Gaussian Mixture Models

Recent Advances in Bayesian Inference Techniques

Introduction to Machine Learning

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

K-Means and Gaussian Mixture Models

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Introduction to Machine Learning Midterm Exam

Data Preprocessing. Cluster Similarity

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Latent Variable View of EM. Sargur Srihari

Contrastive Divergence

STA 4273H: Statistical Machine Learning

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

COM336: Neural Computing

Brief Introduction of Machine Learning Techniques for Content Analysis

ECE521 week 3: 23/26 January 2017

Lecture 8: Clustering & Mixture Models

Statistical Pattern Recognition

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Clustering, K-Means, EM Tutorial

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

Machine Learning for Signal Processing Bayes Classification and Regression

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

Gaussian Mixture Models, Expectation Maximization

Machine Learning Techniques for Computer Vision

Computer Vision Group Prof. Daniel Cremers. 3. Regression

Latent Variable Models and EM algorithm

Expectation maximization

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

CS181 Midterm 2 Practice Solutions

Bayesian Machine Learning

Clustering with k-means and Gaussian mixture distributions

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Clustering with k-means and Gaussian mixture distributions

Stochastic Variational Inference for Gaussian Process Latent Variable Models using Back Constraints

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

Latent Variable Models

Gaussian Mixture Models

Sample Exam COMP 9444 NEURAL NETWORKS Solutions

Introduction to Machine Learning Midterm, Tues April 8

Introduction to Machine Learning Midterm Exam Solutions

Machine Learning Lecture 5

Be able to define the following terms and answer basic questions about them:

STA 414/2104: Machine Learning

The Expectation-Maximization Algorithm

Introduction to Machine Learning

Introduction to Graphical Models

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Machine Learning. Bayes Basics. Marc Toussaint U Stuttgart. Bayes, probabilities, Bayes theorem & examples

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

Study Notes on the Latent Dirichlet Allocation

PATTERN RECOGNITION AND MACHINE LEARNING

Linear Dynamical Systems

Probability models for machine learning. Advanced topics ML4bio 2016 Alan Moses

Probabilistic Graphical Models

Chris Bishop s PRML Ch. 8: Graphical Models

Generative Clustering, Topic Modeling, & Bayesian Inference

Transcription:

Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 218 Outlines Overview Introduction Linear Algebra Probability Linear Regression 1 Linear Regression 2 Linear Classification 1 Linear Classification 2 Kernel Methods Sparse Kernel Methods Mixture Models and EM 1 Mixture Models and EM 2 Neural Networks 1 Neural Networks 2 Principal Component Analysis Autoencoders Graphical Models 1 Graphical Models 2 Graphical Models 3 Sampling Sequential Data 1 Sequential Data 2 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 824

Part XI Mixture Models and EM 1 419of 824

Marginalisation Sum rule p(a, B) = C p(a, B, C) Product rule p(a, B) = p(a B)p(B) Why do we optimize the log likelihood? 42of 824

Strategy in this course Estimate best predictor = training = learning Given data (x 1, y 1 ),..., (x n, y n ), find a predictor f w ( ). 1 Identify the type of input x and output y data 2 Propose a (linear) mathematical model for f w 3 Design an objective function or likelihood 4 Calculate the optimal parameter (w) 5 Model uncertainty using the Bayesian approach 6 Implement and compute (the algorithm in python) 7 Interpret and diagnose results We will study unsupervised learning this week 421of 824

Mixture Models and EM Complex marginal distributions over observed variables can be expressed via more tractable joint distributions over the expanded space of observed and latent variables. Mixture Models can also be used to cluster data. General technique for finding maximum likelihood estimators in latent variable models: expectation-maximisation (EM) algorithm. 422of 824

Example - Wallaby Distribution Introduced very recently to show....3.25.2.15.1.5. -2-15 -1-5 5 1 15 2 423of 824

Example - Wallaby Distribution... that already a mixture of three Gaussian can be fun. p(x) = 3 1 N (x 5,.5) + 3 4 N (x 9, 2) + N (x 2, 2) 1 1.3.25.2.15.1.5. -2-15 -1-5 5 1 15 2 424of 824

Example - Wallaby Distribution Use µ, σ as latent variables and define a distribution 3 1 if (µ, σ) = (5,.5) 3 p(µ, σ) = 1 if (µ, σ) = (9, 2) 4 1 if (µ, σ) = (2, 2) otherwise. p(x) = = p(x, µ, σ) dµ dσ p(x µ, σ) p(µ, σ) dµ dσ = 3 1 N (x 5,.5) + 3 4 N (x 9, 2) + N (x 2, 2) 1 1.3.25.2.15.1.5. -2-15 -1-5 5 1 15 2 425of 824

Given a set of data {x 1,..., x N } where x n R D, n = 1,..., N. Goal: Partition the data into K clusters. 426of 824

Given a set of data {x 1,..., x N } where x n R D, n = 1,..., N. Goal: Partition the data into K clusters. Each cluster contains points close to each other. Introduce a prototype µ k R D for each cluster. Goal: Find 1 a set prototypes µ k, k = 1,..., K, each representing a different cluster. 2 an assignment of each data point to exactly one cluster. 427of 824

- The Algorithm Start with arbitrary chosen prototypes µ k, k = 1,..., K. 1 Assign each data point to the closest prototype. 2 Calculate new prototypes as the mean of all data points assigned to each of them. In the following, we will formalise this introducing a notation which will be useful later. 428of 824

- Notation Binary indicator variables { 1, if data point x n belongs to cluster k r nk =, otherwise using the 1-of-K coding scheme. Define a distortion measure N K J = r nk x n µ k 2 n=1 k=1 Find the values for {r nk } and {µ k } so as to minimise J. 429of 824

- Notation Find the values for {r nk } and {µ k } so as to minimise J. But {r nk } depends on {µ k }, and {µ k } depends on {r nk }. 43of 824

- Notation Find the values for {r nk } and {µ k } so as to minimise J. But {r nk } depends on {µ k }, and {µ k } depends on {r nk }. Iterate until no further change 1 Minimise J w.r.t. r nk while keeping {µ k } fixed, r nk = { 1, if k = arg min j x n µ j 2, otherwise. n = 1,..., N Expectation step 2 Minimise J w.r.t. {µ k } while keeping r nk fixed, = 2 µ k = N r nk(x n µ k ) n=1 N n=1 rnkxn N n=1 rnk Maximisation step 431of 824

- Example 2 (a) 2 (b) 2 (c) 2 2 2 2 2 2 2 2 2 2 (d) 2 (e) 2 (f) 2 2 2 2 2 2 2 2 2 432of 824

- Example 2 (d) 2 (e) 2 (f) 2 2 2 2 2 2 2 2 2 2 (g) 2 (h) 2 (i) 2 2 2 2 2 2 2 2 2 433of 824

- Cost Function 1 J 5 1 2 3 4 Cost function J after each E step (blue points) and M step (red points). 434of 824

- Notes Initial condition crucial for convergence. What happens, if at least one cluster centre is too far from all data points? Complex step: Finding the nearest neighbour. (Use triangle inequality; build K-D trees,...) Generalise to non-euclidean dissimilarity measures V(x n, µ k ) (called K-medoids algorithm), N K J = r nk V(x n, µ k ). n=1 k=1 Online stochastic algorithm 1 Draw data point x n and locate nearest prototype µ k. 2 Update only µ k using decreasing learning rate η n µ new k = µ old k + η n(x n µ old k ). 435of 824

- Image Segmentation Segment an image into regions of reasonable homogeneous appearance. Each pixel is a point in R 3 (red, blue, green). (Note that the pixel intensities are bounded in the range [, 1] and therefore this space is strictly speaking not Euclidean). Run K-means on all points of the image until convergence. Replace all pixels with the corresponding mean µ k. Results in an image with a palette only K different colours. There are much better approaches to image segmentation (but it is an active research topic), this here serves only to illustrate K-means. 436of 824

Illustrating - Segmentation K = 2 K = 1 K = 3 Original image 437of 824

Illustrating - Segmentation 438of 824

- Compression Lossy data compression: accept some errors in the reconstruction as trade-off for higher compression. Apply K-means to the data. Store the code-book vectors µ k. Store the data in the form of references (labels) to the code-book. Each data point has a label in the range [1,..., K]. New data points are also compressed by finding the closest code-book vector and then storing only the label. This technique is also called vector quantisation. 439of 824

Illustrating - Compression K = 2 K = 3 4.2% 8.3% K = 1 Original image 16.7% 1 % 44of 824

Latent variable modeling We have already seen a mixture of two Gaussians for linear classification However in the clustering scenario, we do not observe the class membership Strategy (this is vague) We have a difficult distribution p(x) We introduce a new variable z to get p(x, z) Model p(x, z) = p(x z)p(z) with easy distributions 441of 824

A Gaussian mixture distribution is a linear superposition of Gaussians of the form p(x) = K π k N (x µ k, Σ k ). k=1 As p(x) dx = 1, if follows K k=1 π k = 1. Let us write this with the help of a latent variable z. Definition (Latent variables) Latent variables (as opposed to observable variables), are variables that are not directly observed but are rather inferred (through a mathematical model) from other variables that are observed and directly measured. They are also sometimes called hidden variables, model parameters, or hypothetical variables. 442of 824

Let z {, 1} K and K k=1 z k = 1. In words, z is a K-dimensional vector in 1-of-K representation. There are exactly K different possible vectors z depending on which of the K entries is 1. Define the joint distribution p(x, z) in terms of a marginal distribution p(z) and a conditional distribution p(x z) as p(x, z) = p(z) p(x z) z x 443of 824

Set the marginal distribution to p(z k = 1) = π k where π k 1 together with K k=1 π k = 1. Because z uses 1-of-K coding, we can also write p(z) = K k=1 π zk k. Set the conditional distribution of x given a particular z to p(x z k = 1) = N (x µ k, Σ k ), or K p(x z) = N (x µ k, Σ k ) zk, k=1 444of 824

The marginal distribution over x is now found by summing the joint distribution over all possible states of z p(x) = p(z) p(x z) = z z K = π k N (x µ k, Σ k ) k=1 K k=1 π zk k K N (x µ k, Σ k ) zk k=1 The marginal distribution of x is a Gaussian mixture. For several observations x 1,..., x N we need one latent variable z n per observation. What have we gained? Can now work with the joint distribution p(x, z). Will lead to significant simplification later, especially for EM algorithm. 445of 824

Conditional probability of z given x by Bayes theorem γ(z k ) = p(z k = 1 x) = p(z k = 1) p(x z k = 1) K j=1 p(z j = 1) p(x z j = 1) = π k N (x µ k, Σ k ) K j=1 π j N (x µ j, Σ j ) γ(z k ) is the responsibility of component k to explain the observation x. 446of 824

- Ancestral Sampling Goal: Generate random samples distributed according to the mixture model. 1 Generate a sample ẑ from the distribution p(z). 2 Generate a value x from the conditional distribution p(x ẑ). Example: Mixture of 3 Gaussians, 5 points. 1 (a) 1 (b) 1 (c).5.5.5.5 1.5 1.5 1 Original states of z. Marginal p(x). (R, G, B) - colours mixed according to γ(z nk ). 447of 824

- Maximum Likelihood Given N data points, each of dimension D, we have the data matrix X R N D where each row contains one data point. Similarly, we have the matrix of latent variables Z R N K with rows z T n. Assume the data are drawn i.i.d., the distribution for the data can be represented by a graphical model. π z n x n µ Σ N 448of 824

- Maximum Likelihood The log of the likelihood function is then { N K } ln p(x π, µ, Σ) = ln π k N (x µ k, Σ k ) n=1 k=1 Significant problem: If a mean µ j sits directly on a data point x n then N (x n x n, σ 2 j I) = 1 1. (2π) 1/2 σ j Here we assumed Σ k = σk 2 I. But problem is general, just think of a main axis transformation for Σ k. Overfitting (in disguise) occuring again with the maximum likelihood approach. Use heuristics to detect this situation and reset the mean of the corresponding component of the mixture. 449of 824

- Maximum Likelihood A K component mixture has a total of K! equivalent solutions corresponding to the K! ways of assigning K sets of parameters to K solutions. Also called identifiability problem. Needs to be considered when the parameters discovered by a model are interpreted. Maximising the log likelihood of a Gaussian mixture is more complex then for a single Gaussian. Summation over all K components inside of the logarithm make it harder. Setting the derivatives of the log likelihood to zero does not longer result in a closed form. May use gradient-based optimisation. Or EM algorithm. Stay tuned. 45of 824