CS6220: DATA MINING TECHNIQUES

Similar documents
CS6220: DATA MINING TECHNIQUES

CS249: ADVANCED DATA MINING

CS145: INTRODUCTION TO DATA MINING

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES

CS145: INTRODUCTION TO DATA MINING

Machine Learning for Data Science (CS4786) Lecture 12

CS145: INTRODUCTION TO DATA MINING

CS249: ADVANCED DATA MINING

CSC411: Final Review. James Lucas & David Madras. December 3, 2018

CSCI-567: Machine Learning (Spring 2019)

Gaussian Mixture Models

CS249: ADVANCED DATA MINING

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Data Preprocessing. Cluster Similarity

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Introduction to Machine Learning Midterm Exam Solutions

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

An Introduction to Statistical and Probabilistic Linear Models

Gaussian Mixture Models, Expectation Maximization

Expectation maximization

Machine Learning for Signal Processing Bayes Classification

CS534 Machine Learning - Spring Final Exam

Pattern Recognition and Machine Learning

Introduction to Machine Learning Midterm Exam

Brief Introduction of Machine Learning Techniques for Content Analysis

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Probabilistic Time Series Classification

Lecture 11: Unsupervised Machine Learning

Data Mining Techniques

Click Prediction and Preference Ranking of RSS Feeds

1 EM algorithm: updating the mixing proportions {π k } ik are the posterior probabilities at the qth iteration of EM.

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

FINAL: CS 6375 (Machine Learning) Fall 2014

Machine Learning for Signal Processing Bayes Classification and Regression

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

CS6220: DATA MINING TECHNIQUES

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Machine Learning, Midterm Exam

Support Vector Machines

K-Means and Gaussian Mixture Models

ECE 5984: Introduction to Machine Learning

Clustering and Gaussian Mixture Models

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Statistical Pattern Recognition

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Algorithmisches Lernen/Machine Learning

Data Mining Techniques

Another Walkthrough of Variational Bayes. Bevan Jones Machine Learning Reading Group Macquarie University

CS249: ADVANCED DATA MINING

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Lecture 6: Gaussian Mixture Models (GMM)

Maximum Likelihood Estimation. only training data is available to design a classifier

Probabilistic clustering

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Expectation Maximization Algorithm

Linear Classification: Probabilistic Generative Models

STATS 306B: Unsupervised Learning Spring Lecture 3 April 7th

Introduction to Machine Learning

Expectation Maximization

Mining Classification Knowledge

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Introduction to Graphical Models

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

Machine Learning: A Statistics and Optimization Perspective

Data Mining and Analysis: Fundamental Concepts and Algorithms

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

Exponential Family and Maximum Likelihood, Gaussian Mixture Models and the EM Algorithm. by Korbinian Schwinger

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Generative Clustering, Topic Modeling, & Bayesian Inference

Probabilistic Graphical Models

Generative and Discriminative Approaches to Graphical Models CMSC Topics in AI

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Latent Variable View of EM. Sargur Srihari

STATS 306B: Unsupervised Learning Spring Lecture 5 April 14

CPSC 540: Machine Learning

Unsupervised Learning

Introduction to Machine Learning Midterm, Tues April 8

Kernel methods, kernel SVM and ridge regression

CS Lecture 18. Topic Models and LDA

Lecture 13 : Variational Inference: Mean Field Approximation

Naïve Bayes classification

Linear & nonlinear classifiers

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Note Set 5: Hidden Markov Models

Clustering using Mixture Models

Lecture 14. Clustering, K-means, and EM

Transcription:

CS6220: DATA MINING TECHNIQUES Matrix Data: Clustering: Part 2 Instructor: Yizhou Sun yzsun@ccs.neu.edu November 3, 2015

Methods to Learn Matrix Data Text Data Set Data Sequence Data Time Series Graph & Network Images Classification Decision Tree; Naïve Bayes; Logistic Regression SVM; knn HMM Label Propagation* Neural Network Clustering K-means; hierarchical clustering; DBSCAN; Mixture Models; kernel k-means* PLSA SCAN*; Spectral Clustering* Frequent Pattern Mining Apriori; FPgrowth GSP; PrefixSpan Prediction Linear Regression Autoregression Similarity Search Ranking DTW P-PageRank PageRank 2

Revisit K-means Matrix Data: Clustering: Part 2 Mixture Model and EM algorithm Kernel K-means Summary 3

Objective function Recall K-Means k J = j=1 C i =j x i c j 2 Total within-cluster variance Re-arrange the objective function k J = j=1 i w ij x i c j 2 w ij {0,1} w ij = 1, if x i belongs to cluster j; w ij = 0, otherwise Looking for: The best assignment w ij The best center c j 4

Iterations Solution of K-Means Step 1: Fix centers c j, find assignment w ij that minimizes J => w ij = 1, if x i c j 2 is the smallest Step 2: Fix assignment w ij, find centers that minimize J => first derivative of J = 0 => J c j = 2 i w ij (x i c j ) = 0 =>c j = Note i w ijx i i w ij i w ij is the total number of objects in cluster j J = k j=1 i w ij x i c j 2 5

Converges! Why?

Limitations of K-Means K-means has problems when clusters are of different Sizes Densities Non-Spherical Shapes 12

Limitations of K-Means: Different Density and Size 13

Limitations of K-Means: Non-Spherical Shapes 14

Demo http://webdocs.cs.ualberta.ca/~yaling/cluster/applet/co de/cluster.html 15

Connections of K-means to Other Methods K-means Gaussian Mixture Model Kernel K- means 16

Revisit K-means Matrix Data: Clustering: Part 2 Mixture Model and EM algorithm Kernel K-means Summary 17

Fuzzy Set and Fuzzy Cluster Clustering methods discussed so far Every data object is assigned to exactly one cluster Some applications may need for fuzzy or soft cluster assignment Ex. An e-game could belong to both entertainment and software Methods: fuzzy clusters and probabilistic model-based clusters Fuzzy cluster: A fuzzy set S: F S : X [0, 1] (value between 0 and 1) 18

Mixture Model-Based Clustering A set C of k probabilistic clusters C 1,,C k probability density functions: 1 f,, f k, Cluster prior probabilities: w 1,, w k, j w j = 1 Probability of an object i generated by cluster C j is: P(x i, z i = C j ) = w j f j (x i ) Probability of i generated by the set of cluster C is: P x i = j w j f j (x i ) 19

Maximum Likelihood Estimation Since objects are assumed to be generated independently, for a data set D = {x 1,, x n }, we have, P D = P x i = w j f j (x i ) i i j Task: Find a set C of k probabilistic clusters s.t. P(D) is maximized 20

The EM (Expectation Maximization) Algorithm The (EM) algorithm: A framework to approach maximum likelihood or maximum a posteriori estimates of parameters in statistical models. E-step assigns objects to clusters according to the current fuzzy clustering or parameters of probabilistic clusters w t ij = p z i = j θ t j, x i p x i C t j, θ t j p(c t j ) M-step finds the new clustering or parameters that maximize the expected likelihood 21

Generative model For each object: Gaussian Mixture Model Pick its distribution component: Z~Multi w 1,, w k Sample a value from the selected distribution: X~N μ Z, σ Z 2 Overall likelihood function L D θ = i j w j p(x i μ j, σ j 2 ) Q: What is θ here? 22

Estimating Parameters L D; θ = i log j w j p(x i μ j, σ j 2 ) Considering the first derivative of μ j : Intractable! L u j = i w j j w jp(x i μ j,σ j 2 ) p(x i μ j,σ j 2 ) μ j = i w j p(x i μ j,σ j 2 ) j w jp(x i μ j,σ j 2 ) 1 p(x i μ j,σ j 2 ) p(x i μ j,σ j 2 ) μ j = i w j p(x i μ j,σ j 2 ) j w jp(x i μ j,σ j 2 ) 2 logp(x i μ j,σ j ) u j w ij = P(Z = j X = x i, θ) l(x i )/ μ j Like weighted likelihood estimation; But the weight is determined by the parameters! 23

Apply EM algorithm: 1-d An iterative algorithm (at iteration t+1) E(expectation)-step Evaluate the weight w ij when μ j, σ j, w j are given w ij t = w j t p(x i μ j t,(σj 2 ) t ) j w j t p(x i μ j t,(σ j 2 ) t ) M(maximization)-step Evaluate μ j, σ j, w j when w ij s are given that maximize the weighted likelihood It is equivalent to Gaussian distribution parameter estimation when each point has a weight belonging to each distribution μ j t+1 = i w ij t x i i w ij t ; (σ j 2 ) t+1 = i w ij t x i μ j t 2 i w ij t ; w t+1 t j i w ij 24

Example: 1-D GMM 25

2-d Gaussian Bivariate Gaussian distribution Two dimensional random variable: X = X 1 X 1 X 2 N(μ = μ 1 μ 2, Σ = μ 1 and μ 2 are means of X 1 and X 2 X 2 σ 1 2 σ X 1, X 2 σ X 1, X 2 σ 2 2 ) σ 1 and σ 2 are standard deviations of X 1 and X 2 σ X 1, X 2 is the covariance between X 1 and X 2, i. e., σ X 1, X 2 = E X 1 μ 1 X 2 μ 2 26

Apply EM algorithm: 2-d An iterative algorithm (at iteration t+1) E(expectation)-step Evaluate the weight w ij when μ j, Σ j, w j are given w ij t = w j t p(x i μ j t,σj t ) j w j t p(x i μ j t,σ j t ) M(maximization)-step Evaluate μ j, Σ j, w j when w ij s are given that maximize the weighted likelihood It is equivalent to Gaussian distribution parameter estimation when each point has a weight belonging to each distribution μ j t+1 = i w ij t x i i w ij t (σ X 1, X 2 j ) t+1 = ; (σ 2 j,1 ) t+1 = i w ij t i w ij t t (x i,1 μ j,1 i w ij t t x i,1 μ j,1 i w ij t t )(xi,2 μ j,2) 2 ; (σ 2 j,2 ) t+1 = i w ij t ; w t+1 t j i w ij t x i,2 μ j,2 2 ; i w ij t 27

K-Means: A Special Case of Gaussian Mixture Model When each Gaussian component with covariance matrix σ 2 I Soft K-means p x i μ j, σ 2 2 exp{ x i μ j /σ 2 } When σ 2 0 Soft assignment becomes hard assignment w ij 1, if x i is closest to μ j (why?) Distance! 28

*Why EM Works? E-Step: computing a tight lower bound f of the original objective function at θ old M-Step: find θ new to maximize the lower bound l θ new f θ new f(θ old ) = l(θ old ) 29

*How to Find Tight Lower Bound? q h : the tight lower bound we want to get Jensen s inequality When = holds to get a tight lower bound? q h = p(h d, θ) (why?) 30

Advantages and Disadvantages of GMM Strength Mixture models are more general than partitioning: different densities and sizes of clusters Clusters can be characterized by a small number of parameters The results may satisfy the statistical assumptions of the generative models Weakness Converge to local optimal (overcome: run multi-times w. random initialization) Computationally expensive if the number of distributions is large, or the data set contains very few observed data points Hard to estimate the number of clusters Can only deal with spherical clusters 31

Revisit K-means Matrix Data: Clustering: Part 2 Mixture Model and EM algorithm Kernel K-means Summary 32

*Kernel K-Means How to cluster the following data? A non-linear map: φ: R n F Map a data point into a higher/infinite dimensional space x φ x Dot product matrix K ij K ij =< φ x i, φ(x j ) > 33

Recall kernel SVM: Typical Kernel Functions 34

Solution of Kernel K-Means Objective function under new feature space: k J = j=1 i w ij φ(x i ) c j 2 Algorithm By fixing assignment w ij c j = i w ij φ(x i )/ i w ij In the assignment step, assign the data points to the closest center d x i, c j = φ x i i w i jφ x i i w i j 2 = φ x i φ x i 2 i w i j φ x i φ x i i w i j + i l w i j w lj φ x i φ x l ( i w i j )^2 Do not really need to know φ x, but only K ij 35

Advantages and Disadvantages of Kernel K-Means Advantages Algorithm is able to identify the non-linear structures. Disadvantages Number of cluster centers need to be predefined. Algorithm is complex in nature and time complexity is large. References Kernel k-means and Spectral Clustering by Max Welling. Kernel k-means, Spectral Clustering and Normalized Cut by Inderjit S. Dhillon, Yuqiang Guan and Brian Kulis. An Introduction to kernel methods by Colin Campbell. 36

Revisit K-means Matrix Data: Clustering: Part 2 Mixture Model and EM algorithm Kernel K-means Summary 37

Summary Revisit k-means Derivative Mixture models Gaussian mixture model; multinomial mixture model; EM algorithm; Connection to k-means Kernel k-means* Objective function; solution; connection to k-means 38