Unsupervised Learning

Similar documents
Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Latent Variable Models and EM Algorithm

Machine Learning - MT & 14. PCA and MDS

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Introduction to Machine Learning

Lecture 7: Con3nuous Latent Variable Models

ECE 5984: Introduction to Machine Learning

Latent Variable Models and EM algorithm

Probabilistic & Unsupervised Learning

Gaussian Mixture Models

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Expectation Maximization

PCA, Kernel PCA, ICA

PCA and admixture models

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

Introduction to Machine Learning

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Principal Component Analysis

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Data Mining Techniques

Machine Learning Lecture 5

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

CS281 Section 4: Factor Analysis and PCA

Machine Learning for Data Science (CS4786) Lecture 12

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Probabilistic & Unsupervised Learning. Latent Variable Models

Lecture 6: April 19, 2002

L11: Pattern recognition principles

STA 414/2104: Machine Learning

Expectation Maximization Algorithm

Gaussian Mixture Models, Expectation Maximization

K-Means and Gaussian Mixture Models

Series 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning)

STATS 306B: Unsupervised Learning Spring Lecture 3 April 7th

Statistical Pattern Recognition

Clustering and Gaussian Mixture Models

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Mathematical foundations - linear algebra

Probabilistic Graphical Models

Machine Learning 2nd Edition

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Introduction to Machine Learning

CSE446: Clustering and EM Spring 2017

Unsupervised Learning: K- Means & PCA

CSC411 Fall 2018 Homework 5

Introduction to Machine Learning

CPSC 540: Machine Learning

Expectation Maximization

Probabilistic Latent Semantic Analysis

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

Lecture 2: GMM and EM

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Principal Component Analysis -- PCA (also called Karhunen-Loeve transformation)

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

Factor Analysis and Kalman Filtering (11/2/04)

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Clustering by Mixture Models. General background on clustering Example method: k-means Mixture model based clustering Model estimation

CSC411: Final Review. James Lucas & David Madras. December 3, 2018

Introduction to Graphical Models

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Statistical Learning. Dong Liu. Dept. EEIS, USTC

ECE521 week 3: 23/26 January 2017

Data Mining Techniques

Mathematical Formulation of Our Example

Lecture 11: Unsupervised Machine Learning

Dimension Reduction (PCA, ICA, CCA, FLD,

Pattern Classification

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Clustering, K-Means, EM Tutorial

Gaussian Mixture Models

CPSC 540: Machine Learning

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

Latent Variable Models

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Latent Variable Models and Expectation Maximization

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Expectation Maximization

Machine learning for pervasive systems Classification in high-dimensional spaces

Advanced Introduction to Machine Learning

Lecture 8: Graphical models for Text

Data Mining and Analysis: Fundamental Concepts and Algorithms

CSC 411 Lecture 12: Principal Component Analysis

Machine Learning (CSE 446): Unsupervised Learning: K-means and Principal Component Analysis

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2

Variational Inference (11/04/13)

Mixture Models and Expectation-Maximization

Factor Analysis (10/2/13)

Support Vector Machines and Kernel Methods

20 Unsupervised Learning and Principal Components Analysis (PCA)

What is Principal Component Analysis?

CS181 Midterm 2 Practice Solutions

Manifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA

Transcription:

2018 EE448, Big Data Mining, Lecture 7 Unsupervised Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/ee448/index.html

ML Problem Setting First build and learn p(x) and then infer the conditional dependence p(x t x i ) Unsupervised learning Each dimension of x is equally treated Directly learn the conditional dependence p(x t x i ) Supervised learning x t is the label to predict

Definition of Unsupervised Learning Given the training dataset D = fx i g ;2;:::;N let the machine learn the data underlying patterns Latent variables z! x Probabilistic density function (p.d.f.) estimation p(x) Good data representation (used for discrimination) Á(x)

Uses of Unsupervised Learning Data structure discovery, data science Data compression Outlier detection Input to supervised/reinforcement algorithms (causes may be more simply related to outputs or rewards) A theory of biological learning and perception Slide credit: Maneesh Sahani

Content Fundamentals of Unsupervised Learning K-means clustering Principal component analysis Probabilistic Unsupervised Learning Mixture Gaussians EM Methods

K-Means Clustering

K-Means Clustering

K-Means Clustering Provide the number of desired clusters k Randomly choose k instances as seeds, one per each cluster, i.e. the centroid for each cluster Iterate Assign each instance to the cluster with the closest centroid Re-estimate the centroid of each cluster Stop when clustering converges Or after a fixed number of iterations Slide credit: Ray Mooney

K-Means Clustering: Centriod Assume instances are real-valued vectors x 2 R d Clusters based on centroids, center of gravity, or mean of points in a cluster C k ¹ k = 1 X C k x x2c k Slide credit: Ray Mooney

K-Means Clustering: Distance Distance to a centroid L(x; ¹ k ) Euclidian distance (L2 norm) L 2 (x; ¹ k )=kx ¹ k k = Euclidian distance (L1 norm) Cosine distance L 1 (x; ¹ k )=jx ¹ k j = L cos (x; ¹ k )=1 v u X t d (x i ¹ k m) 2 m=1 dx jx i ¹ k mj m=1 x> ¹ k jxj j¹ k j Slide credit: Ray Mooney

K-Means Example (K=2) Pick seeds Reassign clusters Compute centroids Reasssign clusters x x x x Compute centroids Reassign clusters Converged! Slide credit: Ray Mooney

K-Means Time Complexity Assume computing distance between two instances is O(d) where d is the dimensionality of the vectors Reassigning clusters: O(knd) distance computations Computing centroids: Each instance vector gets added once to some centroid: O(nd) Assume these two steps are each done once for I iterations: O(Iknd) Slide credit: Ray Mooney

K-Means Clustering Objective The objective of K-means is to minimize the total sum of the squared distance of every point to its corresponding cluster centroid min KX f¹ k g K k=1 k=1 X L(x ¹ k ) x2c k ¹ k = 1 X C k x x2c k Finding the global optimum is NP-hard. The K-means algorithm is guaranteed to converge a local optimum.

Seed Choice Results can vary based on random seed selection. Some seeds can result in poor convergence rate, or convergence to sub-optimal clusterings. Select good seeds using a heuristic or the results of another method.

Clustering Applications Text mining Cluster documents for related search Cluster words for query suggestion Recommender systems and advertising Cluster users for item/ad recommendation Cluster items for related item suggestion Image search Cluster images for similar image search and duplication detection Speech recognition or separation Cluster phonetical features

Principal Component Analysis (PCA) An example of 2- dimensional data x 1 : the piloting skill of pilot x 2 : how much he/she enjoys flying Main components u 1 : intrinsic piloting karma of a person u 2 : some noise Example credit: Andrew Ng

Principal Component Analysis (PCA) PCA tries to identify the subspace in which the data approximately lies PCA uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. The number of principal components is less than or equal to the smaller of the number of original variables or the number of observations. R d! R k k d

PCA Data Preprocessing Given the dataset Typically we first pre-process the data to normalize its mean and variance 1. Move the central of the data set to 0 ¹ = 1 m mx D = fx (i) g m x (i) 2. Unify the variance of each variable ¾ j 2 = 1 m mx (x (i) j )2 x (i) Ã x (i) ¹ j )2 x (i) Ã x (i) =¾ j

PCA Data Preprocessing Zero out the mean of the data Rescale each coordinate to have unit variance, which ensures that different attributes are all treated on the same scale.

PCA Solution PCA finds the directions with the largest variable variance which correspond to the eigenvectors of the matrix X T X with the largest eigenvalues

PCA Solution: Data Projection The projection of each point x (i) to a direction u x (i)> u (kuk =1) x (i) u The variance of the projection x (i)> u 1 m mx (x (i)> u) 2 = 1 m mx u > x (i) x (i)> u = u >³ 1 m u > u mx x (i) x (i)> u

PCA Solution: Largest Eigenvalues max u u > u s.t. kuk =1 = 1 m Find k principal components of the data is to find the k principal eigenvectors of Σ i.e. the top-k eigenvectors with the largest eigenvalues Projected vector for x (i) 2 y (i) = 6 4 u > 1 x(i) u > 2 x(i). 3 7 5 2 Rk mx x (i) x (i)> x (i) x (i)> u u u > k x(i)

Eigendecomposition Revisit For a semi-positive square matrix Σ d d suppose u to be its eigenvector with the scalar eigenvalue w Thus any vector v can be written as (kuk =1) u = wu There are d eigenvectors-eigenvalue pairs (u i, w i ) These d eigenvectors are orthogonal, thus they form an orthonormal basis ³ dx v = u i u > i v = Σ d d can be written as = dx u i u > i = dx u i u > i = I dx (u > i v)u i = dx v (i) u i dx w i u i u > i = UWU > U =[u 1 ;u 2 ;:::;u d ] 2 3 w 1 0 0 0 w 2 0 W = 6 4..... 7 0 5 0 0 w d

Eigendecomposition Revisit Given the data 2 x > 1 x X = 6 4 > 2. x > n 3 7 5 The variance in direction u i is The variance in any direction v is and its covariance matrix =X > X kxu i k 2 = u > i X > Xu i = u > i u i = u > i w i u i = w i ³ kxvk 2 dx 2 X = X v (i) u i = v (i) u > i u i v (j) = where v (i) is the projection length of v on u i If v T v = 1, then ij arg max kvk=1 kxvk2 = u (max) (here we may drop m for simplicity) dx v (i) 2 w i The direction of greatest variance is the eigenvector with the largest eigenvalue

PCA Discussion PCA can also be derived by picking the basis that minimizes the approximation error arising from projecting the data onto the k-dimensional subspace spanned by them.

PCA Visualization http://setosa.io/ev/principal-component-analysis/

PCA Visualization http://setosa.io/ev/principal-component-analysis/

Content Fundamentals of Unsupervised Learning K-means clustering Principal component analysis Probabilistic Unsupervised Learning Mixture Gaussians EM Methods

Mixture Gaussian

Mixture Gaussian

Graphic Model for Mixture Gaussian Given a training set fx (1) ;x (2) ;:::;x (m) g Model the data by specifying a joint distribution p(x (i) ;z (i) )=p(x (i) jz (i) )p(z (i) ) Á Parameters of latent variable distribution z (i)» Multinomial(Á) p(z (i) = j) =Á j x (i)» N (¹ j ; j ) z x Latent variable: the Gaussian cluster ID Indicates which Gaussian each x comes from Observed data points

Data Likelihood We want to maximize l(á; ¹; ) = = mx log p(x (i) ; Á; ¹; ) mx log mx = log kx p(x (i) jz (i) ; ¹; )p(z (i) ; Á) z (i) =1 j=1 kx N (x (i) j¹ j ; j )Á j No closed form solution by simply setting @l(á; ¹; ) @l(á; ¹; ) @l(á; ¹; ) =0 =0 @Á @¹ @ =0

Data Likelihood Maximization For each data point x (i), latent variable z (i) indicates which Gaussian it comes from If we knew z (i), the data likelihood l(á; ¹; ) = = = mx log p(x (i) ; Á; ¹; ) mx log p(x (i) jz (i) ; ¹; )p(z (i) ; Á) mx log N (x (i) j¹ z (i); z (i))+logp(z (i) ; Á)

Data Likelihood Maximization Given z (i), maximize the data likelihood max l(á; ¹; ) = max Á;¹; Á;¹; It is easy to get the solution Á j = 1 mx 1fz (i) = jg m P m ¹ j = 1fz(i) = jgx (i) P m 1fz(i) = jg mx log N (x (i) j¹ z (i); z (i))+logp(z (i) ; Á) j = P m 1fz(i) = jg(x (i) ¹ j )(x (i) ¹ j ) > P m 1fz(i) = jg

Latent Variable Inference Given the parameters μ, Σ, ϕ, it is not hard to infer the posterior of the latent variable z (i) for each instance Á z x p(z (i) = jjx (i) ; Á; ¹; ) = p(z(i) = j; x (i) ; Á; ¹; ) p(x (i) ; Á; ¹; ) ¹; where The prior of z (i) is The likelihood is = p(x(i) jz (i) = j; ¹; )p(z (i) = j; Á) P k l=1 p(x(i) jz (i) = l; ¹; )p(z (i) = l; Á) p(z (i) = j; Á) p(x (i) jz (i) = j; ¹; )

Expectation Maximization Methods E-step: infer the posterior distribution of the latent variables given the model parameters M-step: tune parameters to maximize the data likelihood given the latent variable distribution EM methods Iteratively execute E-step and M-step until convergence

EM Methods for Mixture Gaussians Mixture Gaussian example Repeat until convergence: { (E-step) For each i, j, set } w (i) j = p(z (i) = j; x (i) ; Á; ¹; ) (M-step) Update the parameters Á j = 1 m ¹ j = j = mx w (i) j P m w(i) j x(i) P m w(i) j P m w(i) P m w(i) j j (x(i) ¹ j )(x (i) ¹ j ) > Á z x ¹;

General EM Methods Claims: 1. After each E-M step, the data likelihood will not decrease. 2. The EM algorithm finds a (local) maximum of a latent variable model likelihood Now let s discuss the general EM methods and verify its effectiveness of improving data likelihood and its convergence

Jensen s Inequality Theorem. Let f be a convex function, and let X be a random variable. Then: E[f(X)] f(e[x]) Moreover, if f is strictly convex, then holds true if and only if E[f(X)] = f(e[x]) X = E[X] with probability 1 (i.e., if X is a constant).

Jensen s Inequality E[f(X)] f(e[x]) Figure credit: Andrew Ng

Jensen s Inequality Figure credit: Maneesh Sahani

General EM Methods: Problem Given the training dataset D = fx i g ;2;:::;N let the machine learn the data underlying patterns Assume latent variables z! x We wish to fit the parameters of a model p(x,z) to the data, where the log-likelihood is l(μ) = = NX log p(x; μ) NX log X z p(x; z; μ)

General EM Methods: Problems EM methods solve the problems where Explicitly find the maximum likelihood estimation (MLE) is hard μ =argmax μ NX log X z p(x (i) ;z (i) ; μ) But given z (i) observed, the MLE is easy μ =argmax μ NX log p(x (i) jz (i) ; μ) EM methods give an efficient solution for MLE, by iteratively doing E-step: construct a (good) lower-bound of log-likelihood M-step: optimize that lower-bound

General EM Methods: Lower Bound For each instance i, let q i be some distribution of z (i) X q i (z) =1; q i (z) 0 Thus the data log-likelihood l(μ) = z NX log p(x (i) ; μ) = = NX log X p(x (i) ;z (i) ; μ) z (i) NX log X q i (z (i) ) p(x(i) ;z (i) ; μ) q i (z (i) ) z (i) NX X q i (z (i) p(x (i) ;z (i) ; μ) )log q i (z (i) ) z (i) Jensen s inequality -log(x) is a convex function Lower bound of l(θ)

General EM Methods: Lower Bound NX l(μ) = log p(x (i) ; μ) NX X Then what q i (z) should we choose? q i (z (i) )log z (i) p(x (i) ;z (i) ; μ) q i (z (i) )

REVIEW Jensen s Inequality Theorem. Let f be a convex function, and let X be a random variable. Then: E[f(X)] f(e[x]) Moreover, if f is strictly convex, then holds true if and only if E[f(X)] = f(e[x]) X = E[X] with probability 1 (i.e., if X is a constant).

General EM Methods: Lower Bound l(μ) = NX log p(x (i) ; μ) NX X q i (z (i) p(x (i) ;z (i) ; μ) )log q i (z (i) ) z (i) Then what q i (z) should we choose? In order to make above inequality tight (to hold with equality), it is sufficient that We can derive log p(x (i) ; μ) =log X z (i) p(x (i) ;z (i) ; μ) =q i (z (i) ) c p(x (i) ;z (i) ; μ) =log X z (i) q(z (i) )c = X z (i) As such q i (z) is written as the posterior distribution q i (z (i) )= p(x(i) ;z (i) ; μ) P z p(x(i) ;z; μ) = p(x(i) ;z(i) ; μ) p(x (i) ; μ) q i (z (i) )log p(x(i) ;z (i) ; μ) q i (z (i) ) = p(z (i) jx (i) ; μ)

General EM Methods Repeat until convergence: { (E-step) For each i, set } q i (z (i) )=p(z (i) jx (i) ; μ) (M-step) Update the parameters μ =argmax μ NX X q i (z (i) p(x (i) ;z (i) ; μ) )log q i (z (i) ) z (i)

Convergence of EM Denote θ (t) and θ (t+1) as the parameters of two successive iterations of EM, we prove that l(μ (t) ) l(μ (t+1) ) which shows EM always monotonically improves the loglikelihood, thus ensures EM will at least converge to a local optimum.

Proof of EM Convergence Start from θ (t), we choose the posterior of latent variable q (t) i (z (i) )=p(z (i) jx (i) ; μ (t) ) This choice ensures the Jensen s inequality holds with equality l(μ (t) )= NX log X z (i) q (t) i (z (i) ) p(x(i) ;z (i) ; μ (t) ) q (t) i (z (i) ) = NX X q i (z (i) p(x (i) ;z (i) ; μ (t) ) )log q (t) z (i) i (z (i) ) Then the parameters θ (t+1) are then obtained by maximizing the right hand side of above equation Thus l(μ (t+1) ) NX X q (t) z (i) NX X = l(μ (t) ) q (t) z (i) i (z (i) )log p(x(i) ;z (i) ; μ (t+1) ) q (t) i (z (i) ) i (z (i) )log p(x(i) ;z (i) ; μ (t) ) q (t) i (z (i) ) [lower bound] [parameter optimization]

Remark of EM Convergence If we define J(q; μ) = Then we know NX X q i (z (i) )log z (i) l(μ) J(q;μ) p(x (i) ;z (i) ; μ) q i (z (i) ) EM can also be viewed as a coordinate ascent on J E-step maximizes it w.r.t. q M-step maximizes it w.r.t. θ

Coordinate Ascent in EM q (t) i (z (i) )=p(z (i) jx (i) ; μ (t) ) q Figure credit: Maneesh Sahani μ (t+1) =argmax μ NX X μ q (t) z (i) i (z (i) )log p(x(i) ;z (i) ; μ) q (t) i (z (i) )