Machine Learning for Data Science (CS4786) Lecture 8

Similar documents
Machine Learning for Data Science (CS4786) Lecture 2

Clustering and Gaussian Mixture Models

Clustering. Stephen Scott. CSCE 478/878 Lecture 8: Clustering. Stephen Scott. Introduction. Outline. Clustering.

Statistical Machine Learning

Data Preprocessing. Cluster Similarity

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Learning Vector Quantization (LVQ)

Learning Vector Quantization

Logic and machine learning review. CS 540 Yingyu Liang

Machine Learning for Data Science (CS4786) Lecture 12

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Introduction to Signal Detection and Classification. Phani Chavali

An introduction to clustering techniques

MLCC Clustering. Lorenzo Rosasco UNIGE-MIT-IIT

Neural Networks Lecture 4: Radial Bases Function Networks

Machine Learning for Data Science (CS4786) Lecture 11

CSE446: Clustering and EM Spring 2017

Machine Learning (CSE 446): Unsupervised Learning: K-means and Principal Component Analysis

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

Overview of clustering analysis. Yuehua Cui

Machine Learning - MT Clustering

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Computer Vision Group Prof. Daniel Cremers. 14. Clustering

Learning with multiple models. Boosting.

Learning Theory Continued

Generalized Ward and Related Clustering Problems

Data Exploration and Unsupervised Learning with Clustering

Multivariate Statistics

Unsupervised machine learning

Linear Models for Classification

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

A Bayesian Criterion for Clustering Stability

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

Clusters. Unsupervised Learning. Luc Anselin. Copyright 2017 by Luc Anselin, All Rights Reserved

Machine Learning. Clustering 1. Hamid Beigy. Sharif University of Technology. Fall 1395

Decision Tree Learning Lecture 2

University of Florida CISE department Gator Engineering. Clustering Part 1

STATISTICA MULTIVARIATA 2

K-means. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. November 19 th, Carlos Guestrin 1

Pattern Recognition and Machine Learning. Learning and Evaluation of Pattern Recognition Processes

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

Machine Learning, Fall 2012 Homework 2

FINAL EXAM: FALL 2013 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Machine Learning Lecture 5

Gaussian Mixture Models

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Principles of Pattern Recognition. C. A. Murthy Machine Intelligence Unit Indian Statistical Institute Kolkata

Predictive analysis on Multivariate, Time Series datasets using Shapelets

Support Vector Machines and Kernel Methods

Dimensionality Reduction

Clustering, K-Means, EM Tutorial

CSCI 360 Introduc/on to Ar/ficial Intelligence Week 2: Problem Solving and Op/miza/on

Clustering using Mixture Models

ECE 5984: Introduction to Machine Learning

Modern Information Retrieval

Linear Classifiers. Michael Collins. January 18, 2012

Homework 3. Convex Optimization /36-725

Introduction to Bayesian Learning. Machine Learning Fall 2018

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Machine Learning Lecture 2

Lecture 1: September 25, A quick reminder about random variables and convexity

Heuristics for The Whitehead Minimization Problem

Expectation maximization

Expectation Maximization (EM)

PCA and admixture models

CS534 Machine Learning - Spring Final Exam

The Perceptron Algorithm, Margins

The Expectation-Maximization Algorithm

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

COMS 4771 Introduction to Machine Learning. Nakul Verma

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2016

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Linear Regression (continued)

Empirical Risk Minimization, Model Selection, and Model Assessment

Selected Topics in Optimization. Some slides borrowed from

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Introduction to Machine Learning (67577) Lecture 3

Machine Learning Lecture 2

Machine Learning and Adaptive Systems. Lectures 3 & 4

Deep Learning. Authors: I. Goodfellow, Y. Bengio, A. Courville. Chapter 4: Numerical Computation. Lecture slides edited by C. Yim. C.

Statistics and learning: Big Data

Lecture 5: November 19, Minimizing the maximum intracluster distance

9/26/17. Ridge regression. What our model needs to do. Ridge Regression: L2 penalty. Ridge coefficients. Ridge coefficients

Multisurface Proximal Support Vector Machine Classification via Generalized Eigenvalues

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Gaussian and Linear Discriminant Analysis; Multiclass Classification

FACTOR ANALYSIS AND MULTIDIMENSIONAL SCALING

Unsupervised Learning: K-Means, Gaussian Mixture Models

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Expectation Maximization, and Learning from Partly Unobserved Data (part 2)

Performance Evaluation and Comparison

Clustering by Mixture Models. General background on clustering Example method: k-means Mixture model based clustering Model estimation

Machine Learning Linear Classification. Prof. Matteo Matteucci

Issues and Techniques in Pattern Classification

Machine Learning Linear Models

Machine Learning for Data Science (CS4786) Lecture 24

CS489/698: Intro to ML

Transcription:

Machine Learning for Data Science (CS4786) Lecture 8 Clustering Course Webpage : http://www.cs.cornell.edu/courses/cs4786/2016fa/

Announcement Those of you who submitted HW1 and are still on waitlist email me.

CLUSTERING n X X d

CLUSTERING Grouping sets of data points s.t. points in same group are similar points in different groups are dissimilar A form of unsupervised classification where there are no predefined labels

SOME NOTATIONS Kary clustering is a partition of x 1,...,x n into K groups For now assume the magical K is given to use Clustering given by C 1,...,C K, the partition of data points. Given a clustering, we shall use c(x t ) to denote the cluster identity of point x t according to the clustering. Let n j denote C j, clearly K j=1 n j = n.

How do we formalize a good clustering objective?

How do we formalize? Say dissimilarity(x t, x s ) measures dissimilarity between x t & x s Given two clustering {C 1,...,C K } (or c) and {C 0 1,...,C 0 K} (or c 0 ) How do we decide which is better? points in same cluster are not dissimilar points in different clusters are dissimilar

CLUSTERING CRITERION Minimize total within-cluster dissimilarity M 1 = K j=1 dissimilarity(x t, x s ) s,t C j Maximize between-cluster dissimilarity M 2 = x s,x t c(x s ) c(x t ) dissimilarity(x t, x s ) Maximize smallest between-cluster dissimilarity M 3 = min dissimilarity(x t, x s ) x s,x t c(x s ) c(x t ) Minimize largest within-cluster dissimilarity M 4 = max j [K] max s,t C j dissimilarity(x t, x s )

CLUSTERING CRITERION Minimize average dissimilarity within cluster M 6 = K j=1 1 C j s C j dissimilarity x s, C j = = K j=1 K j=1 1 C j s C j t C j,t s 1 C j s C j t C j,t s dissimilarity (x s, x t ) x s x t 2 2 Minimize within-cluster variance: r j = 1 n j x Cj x M 5 = K j=1 t C j x t r j 2 2

How different are these criteria?

CLUSTERING CRITERION minimizing M 1 maximizing M 2 minimizing M 5 minimizing M 6

CLUSTERING Multiple clustering criteria all equally valid Different criteria lead to different algorithms/solutions Which notion of distances or costs we use matter

= ( ) Lets build algorithm for two criteria = ( ) = nj 1 M 5 = K = j=1 ( ) ( ) x t r j 2 2 x t C j t ( ) 2 M 3 = min dissimilarity(x t, x s ) x s,x t c(x s ) c(x t )

Lets build an Algorithm = nj M 5 = K j=1 x t r j 2 2 x t C j t where r j = 1 C j X t x t 2C j x t

K-MEANS CLUSTERING For all j [K], initialize cluster centroids ˆr 1 j randomly and set m = 1 Repeat until convergence (or until patience runs out) 1 For each t {1,...,n}, set cluster identity of the point ĉ m (x t ) = argmin j [K] x t ˆr m j 2 For each j [K], set new representative as ˆr m+1 j = 1 Ĉm j xt Ĉm j x t 3 m m + 1

K-MEANS CONVERGENCE K-means algorithm converges to local minima of objective O (c; r 1,...,r K ) = K j=1 c(x t )=j x t r j 2 2 Proof: Clustering assignment improves objective: O ĉ m 1 ; r m 1,...,rm K O (ĉm ; r m 1,...,rm K ) (By definition of ĉ m (x t )) Computing centroids improves objective: O (ĉ m ; r m 1,...,rm K ) O ĉm ; r m+1 1,...,r m+1 K (By the fact about centroid)

= ( ) Lets build an Algorithm = ( ) ( ) ( ) M 3 = min dissimilarity(x t, x s ) x s,x t c(x s ) c(x t ) = [ ] ( )

dissimilarity(c i,c j )= min dissimilarity(x t, x s ) t2c i,s2c j

dissimilarity(c i,c j )= min dissimilarity(x t, x s ) t2c i,s2c j

dissimilarity(c i,c j )= min dissimilarity(x t, x s ) t2c i,s2c j

dissimilarity(c i,c j )= min dissimilarity(x t, x s ) t2c i,s2c j

dissimilarity(c i,c j )= min dissimilarity(x t, x s ) t2c i,s2c j

dissimilarity(c i,c j )= min dissimilarity(x t, x s ) t2c i,s2c j

dissimilarity(c i,c j )= min dissimilarity(x t, x s ) t2c i,s2c j

dissimilarity(c i,c j )= min dissimilarity(x t, x s ) t2c i,s2c j

dissimilarity(c i,c j )= min dissimilarity(x t, x s ) t2c i,s2c j

dissimilarity(c i,c j )= min dissimilarity(x t, x s ) t2c i,s2c j

SINGLE LINK CLUSTERING Initialize n clusters with each point x t to its own cluster Until there are only K clusters, do 1 Find closest two clusters and merge them into one cluster 2 Update between cluster distances (called proximity matrix)

SINGLE LINK CLUSTERING Initialize n clusters with each point x t to its own cluster Until there are only K clusters, do 1 Find closest two clusters and merge them into one cluster 2 Update between cluster distances (called proximity matrix) dissimilarity(c i,c j )= min dissimilarity(x t, x s ) t2c i,s2c j

SINGLE LINK OBJECTIVE = ( ) Objective for single-link: = ( ) ( ) ( ) M M 4 = min x s x t 2 3 = min dissimilarity(x 2 t, x s ) x s,x t c(xx s ) c(x,x t c(x t ) s ) c(x t ) Single link clustering is optimal for above objective! = [ ] ( )