Data Preprocessing. Cluster Similarity

Similar documents
University of Florida CISE department Gator Engineering. Clustering Part 1

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

Clustering. Stephen Scott. CSCE 478/878 Lecture 8: Clustering. Stephen Scott. Introduction. Outline. Clustering.

Clustering and Gaussian Mixture Models

More on Unsupervised Learning

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Gaussian Mixture Models

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Mixture of Gaussians Models

CSE446: Clustering and EM Spring 2017

Probabilistic clustering

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Statistical Pattern Recognition

Mixtures of Gaussians continued

Expectation Maximization Algorithm

CPSC 540: Machine Learning

Unsupervised machine learning

Expectation Maximization

CS6220: DATA MINING TECHNIQUES

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Clustering VS Classification

The Expectation-Maximization Algorithm

Clustering Lecture 1: Basics. Jing Gao SUNY Buffalo

STATS 306B: Unsupervised Learning Spring Lecture 5 April 14

Regression Clustering

Expectation Maximization

Non-Parametric Bayes

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2

Lecture 11: Unsupervised Machine Learning

Multivariate Statistics

CS534 Machine Learning - Spring Final Exam

Data Exploration and Unsupervised Learning with Clustering

Clustering using Mixture Models

Pattern Recognition. Parameter Estimation of Probability Density Functions

Motivating the Covariance Matrix

Latent Variable Models and Expectation Maximization

K-means. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. November 19 th, Carlos Guestrin 1

L11: Pattern recognition principles

CS6220: DATA MINING TECHNIQUES

Statistical learning. Chapter 20, Sections 1 4 1

10-810: Advanced Algorithms and Models for Computational Biology. Optimal leaf ordering and classification

EM (cont.) November 26 th, Carlos Guestrin 1

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Latent Variable Models and Expectation Maximization

Pattern recognition. "To understand is to perceive patterns" Sir Isaiah Berlin, Russian philosopher

Statistical Pattern Recognition

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Generative Clustering, Topic Modeling, & Bayesian Inference

Machine Learning. Clustering 1. Hamid Beigy. Sharif University of Technology. Fall 1395

Review and Motivation

Unsupervised Learning: K-Means, Gaussian Mixture Models

A Bayesian Criterion for Clustering Stability

Clustering and Gaussian Mixtures

COM336: Neural Computing

Latent Variable Models

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Principles of Pattern Recognition. C. A. Murthy Machine Intelligence Unit Indian Statistical Institute Kolkata

Machine Learning for Data Science (CS4786) Lecture 12

Review of Maximum Likelihood Estimators

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

CS 340 Lec. 18: Multivariate Gaussian Distributions and Linear Discriminant Analysis

Part I. Linear regression & LASSO. Linear Regression. Linear Regression. Week 10 Based in part on slides from textbook, slides of Susan Holmes

Multivariate Analysis

Gaussian Mixture Models, Expectation Maximization

CPSC 540: Machine Learning

Machine Learning, Midterm Exam

Expectation maximization

Applying cluster analysis to 2011 Census local authority data

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Introduction to Machine Learning Midterm Exam

Bias-Variance Tradeoff

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Lecture 6: Gaussian Mixture Models (GMM)

Expectation Maximization (EM)

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Series 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning)

Mixture Models. Michael Kuhn

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

CPSC 540: Machine Learning

Algorithms for Picture Analysis. Lecture 07: Metrics. Axioms of a Metric

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

Multivariate Statistics: Hierarchical and k-means cluster analysis

Unsupervised Learning with Permuted Data

Lecture 13: Data Modelling and Distributions. Intelligent Data Analysis and Probabilistic Inference Lecture 13 Slide No 1

Machine Learning Linear Classification. Prof. Matteo Matteucci

Lecture 4: Probabilistic Learning

FINAL: CS 6375 (Machine Learning) Fall 2014

Accelerating the EM Algorithm for Mixture Density Estimation

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Statistics 202: Data Mining. c Jonathan Taylor. Model-based clustering Based in part on slides from textbook, slides of Susan Holmes.

Unsupervised Learning. k-means Algorithm

Module Master Recherche Apprentissage et Fouille

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Computer Vision Group Prof. Daniel Cremers. 14. Clustering

Transcription:

1 Cluster Similarity Similarity is most often measured with the help of a distance function. The smaller the distance, the more similar the data objects (points). A function d: M M R is a distance on M if it satisfies for all x, y, z M (where M is an arbitrary nonempty set and R is the set of real numbers): i d x, y = 0 x = y coincidence axiom ii d x, y = d y, x symmetry iii d x, z d x, y + d y, z triangle inequality

Distance Functions: Minkowski The Minkowski family of distance functions is defined as: d k x, y = n i=1 x i y i k where n is the number of features and k is a parameter. 1 k k = 1: Manhattan distance (or city block distance) k = 2: Euclidean distance k : maximum distance n (i. e. d x, y = max i=1 x i y i ) 2

Distance Functions: Mahalanobis Mahalanobis distance is defined as: d x, y = x y T Σ 1 x y where Σ is the covariance matrix of x and y. Euclidean Distance clusters Mahalanobis Distance clusters means means 3

4 Exclusive vs. Overlapping vs. Fuzzy An exclusive clustering assigns each object to a single cluster. An overlapping clustering or non-exclusive clustering is used to reflect the fact that an object can simultaneously belong to more than one group (class). A fuzzy cluster assigns objects to a cluster with a membership weight that is between 0 (absolutely doesn t belong) and 1 (absolutely belongs).

Clustering Using Mixture Models Often, it is convenient and effective to assume that data has been generate by a statistical process. To describe the data, we can find the statistical model (described by a distribution and a set of parameters for that distribution) that best fits the data. Mixture models use statistical distributions to model the data, each distribution corresponding to a cluster and the parameters of each distribution describing the corresponding cluster. 5

Mixture Models Mixture models view the data as a set of observations from a mixture of different probability distributions. The Idea: Given several distributions, usually of the same type, but with different parameters, randomly select one of these distributions and generate an object from it. Repeat the process m times, where m is the number of objects. 6

The EM Algorithm 7 EM algorithm 1: Select an initial set of model parameters. 2: repeat 3: Expectation Step For each object, calculate the probability that each object belongs to each distribution, i.e., calculate prob distribution j x i, Θ. 4: Maximization Step Given the probabilities from the expectation step, find the new estimates of the parameters that maximize the expected likelihood. 5: until The parameters do not change (or are below a threshold).

The EM Algorithm: A Visual Description distributions Consider two initial distributions. Here, we use two Gaussian distributions, each with its own mean, μ, and variance, σ. means First, we make an initial guess for μ 1 and μ 2 (perhaps initializing σ 1 and σ 2 to a constant). 8

The EM Algorithm: A Visual Description means distributions For the expectation step, we compute the probability that a point came from a particular distribution. These values can be expressed by an application of Bayes rule. 9

The EM Algorithm: A Visual Description means distributions After computing the cluster membership probabilities for all the points, we compute new estimates for for μ 1 and μ 2. This is the maximization step. Here, the new estimate for the mean of a distribution is just a weighted average of the points, where the weights are the probabilities that the points belong to the distribution. 10

The EM Algorithm: A Visual Description distributions We repeat these two steps until the estimates of μ 1 and μ 2 either don t change or change very little. means 11

Mixture Models: A Mathy Description Assume K distributions and m objects, X = x 1,, x m. Let the j th distribution have parameters θ j, and let Θ be the set of all parameters, i.e. Θ = θ 1,, θ k. The probability that the j th distribution is chosen to generate an object is given by the weight w j, 1 j K, where these weights (probabilities) sum to one, i.e., K w j = 1. Then the probability of an object x is given by: 12 j=1 prob x i Θ = K j=1 w j p j x i θ j The probability of object x if it comes from the j th distribution.

Mixture Models: A Mathy Description If the objects are generated in an independent manner, then the probability of the entire set of objects is just the product of the probabilities of each individual x i prob X Θ = m i=1 prob x i Θ = m i=1 K j=1 w j p j x i θ j The probability of i th object. The probability of i th object if it comes from the j th distribution. 13

Gaussian Mixture: A Mathy Description Consider a mixture model in terms of Gaussian distributions. The probability density function for a one-dimensional Gaussian distribution at a point x is: prob x i Θ = 1 x i μ 2 2πσ e 2σ 2 The parameters of the Gaussian distribution are given by θ = μ, σ, where μ is the mean of the distribution and σ is the standard deviation. 14

MLE: A Mathy Description Consider a set of m points that are generated from a onedimensional Gaussian distribution: prob X Θ = log prob X Θ = m i=1 m xi μ 2 i=1 2σ 2 1 2πσ e x i μ 2 2σ 2 0.5m log 2π m log σ One approach to estimate μ and σ is to choose the values of the parameters for which the data is most probable (most likely), known as the maximum likelihood principle (MLE). 15

MLE: A Mathy Description The concept is called the maximum likelihood principle because, given a set of data, the probability of the data, regarded as a function for the parameters, is called a likelihood function. likelihood Θ X = Θ X = m i=1 1 2πσ e x i μ 2 2σ 2 For practical reasons, the log likelihood is more commonly used. Note that the parameter values that maximize the log likelihood also maximize the likelihood. 16 log likelihood Θ X = l Θ X = m xi μ 2 i=1 2σ 2 0.5m log 2π m log σ

Example of EM: Initialization Begin the EM algorithm by making initial guesses for μ 1 and μ 2, say, μ 1 = 2 and μ 2 = 3. Thus, the initial parameters, θ = μ, σ, for the two distributions are, respectively, θ 1 = 2,2 and θ 1 = 3,2. The set of parameters for the entire mixture model is Θ = θ 1, θ 2. 17

Example of EM: Expectation Step For the expectation step of EM, we want to compute the probability that a point came from a particular distribution. These values can be expressed by a straightforward application of Bayes rule: prob distribution j x i, θ = 0.5 prob x i θ j 0.5 prob x i θ 1 + 0.5 prob x i θ 2 where 0.5 is the probability (weight) of each distribution and j is 1 or 2. 18

Example of EM: Maximization Step After computing the cluster membership probabilities for all the points, we compute new estimates for μ 1 and μ 2 in the maximization step of the EM algorithm. μ j = m i=1 x i prob distribution j x i, Θ i=1 m prob distribution j x i, Θ where j is a particular distribution and m is the number of objects. 19

Mixture Model Clustering Using EM The EM algorithm can be slow. Does not work well when clusters contain only a few data points or if the data points are nearly co-linear. The problem in estimating the number of clusters or choosing the exact form of the model to use. More general than k-means because they can use distributions of various types. Easy to characterize the clusters produced. 20

Clustering Evaluation 1. Determining the clustering tendency of a set of data, i.e., distinguishing whether non-random structure actually exists in the data. 2. Determining the correct number of clusters. 3. Evaluating how well the results of a cluster analysis fit the data without reference to external information. 4. Comparing the results of a cluster analysis to externally known results, such as externally provided class labels. 5. Comparing two sets of clusters to determine which is better. 21

Types of Cluster Evaluation Unsupervised: Measures the goodness of a clustering structure without respect to external information. Supervised: Measures the extent to which the clustering structure discovered by a clustering algorithm matches some external structure. Relative: Compares different clusterings or clusters with a supervised or unsupervised evaluation measure that is used for the purpose of comparison. 22

Overall Cluster Validity In general, we can consider expressing overall cluster validity for a set of K clusters as a weighted sum of the validity of individual clusters. overall validity = K i=1 w i validity C i The validity function can be cohesion (compactness, tightness), separation (isolation), or some combination. 23

Prototype-Based Validity Cohesion can be defined as the sum of the proximities with respect to the prototype of the cluster. Separation can be measured by the proximity of the two cluster prototypes. Cohesion Separation 24

The Silhouette Coefficient This method combines cohesion and separation: 1. For the i th object, calculate its average distance to all other objects in its cluster, a i. 2. For the i th object and any cluster not containing the object, calculate the object s average distance to all the objects in the given cluster. Find the minimum such value with respect to all clusters, b i. 3. For the i th object, the silhouette coefficient is: s i = b i a i max a i, b i 25

The Silhouette Coefficient The silhouette coefficient combines both cohesion and separation. The value can vary between -1 and 1. A negative value is undesirable, because this corresponds to a case in which a i, the average distance to points in the cluster, is greater than b i, the minimum average distance to points in another cluster. 26 We want the silhouette coefficient to be positive a i < b i, and for a i to be as close to 0 as possible.

Hierarchical Cluster Evaluation The cophenetic distance between two objects is the proximity at which an agglomerative hierarchical clustering technique puts the objects in the same cluster for the first time. For example, if at some point in the agglomerative hierarchical clustering process, the smallest distance between the two clusters that are merged is 0.1, then all points in one cluster have a cophenetic distance of 0.1 27

Example of Cophenetic Distance (Single Link Clustering) Hierarchical Tree Diagram Cophenetic Distance Matrix 0.2 0.15 0.1 0.05 0 28 3 6 2 5 4 1 p1 p2 p3 p4 p5 p6 p1 0 0.22 0.22 0.22 0.22 0.22 p2 0.22 0 0.15 0.15 0.14 0.15 p3 0.22 0.15 0 0.15 0.15 0.11 p4 0.22 0.15 0.15 0 0.15 0.15 p5 0.22 0.14 0.15 0.15 0 0.15 p6 0.22 0.15 0.11 0.15 0.15 0

Summarization of Clustering The greater the similarity (or homogenity) within a group and the greater the difference between groups, the better or more distinct the clustering. Clusters can be partitional (i.e., k-means) or hierarchical (i.e., agglomerative hierarchical). Clusterings can be exclusive, overlapping, or fuzzy (i.e., mixture models). Cluster evaluation assessed the validity of clusters. 29

Summarization of 30

Summarization of analysis and cluster analysis are both types of unsupervised learning (the process of trying to find hidden structure in unlabeled data). analysis is the process of discovering interesting relations between variables in large datasets. Cluster analysis is the process of grouping data objects based only on the description of objects and their relationships. 31

And now Lets see some cluster evaluation! 32