Clustering. Léon Bottou COS 424 3/4/2010. NEC Labs America

Similar documents
Expectation-Maximization

Numerical Optimization Techniques

Classification and Pattern Recognition

STA 4273H: Statistical Machine Learning

Lecture 4: Probabilistic Learning

K-Means and Gaussian Mixture Models

Machine Learning for Signal Processing Bayes Classification and Regression

Machine Learning Lecture 5

CSE446: Clustering and EM Spring 2017

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

CSC2541 Lecture 5 Natural Gradient

CS60021: Scalable Data Mining. Large Scale Machine Learning

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

STA 4273H: Sta-s-cal Machine Learning

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Expectation Maximization Algorithm

Bayesian Methods for Machine Learning

EM Algorithm & High Dimensional Data

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

K-Means, Expectation Maximization and Segmentation. D.A. Forsyth, CS543

ECE521 week 3: 23/26 January 2017

But if z is conditioned on, we need to model it:

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

ECE 5984: Introduction to Machine Learning

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Expectation Maximization

Latent Variable Models

Lecture 6: Gaussian Mixture Models (GMM)

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16

Linear Models for Regression CS534

Maximum Likelihood Estimation. only training data is available to design a classifier

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

Probabilistic Graphical Models & Applications

Clustering compiled by Alvin Wan from Professor Benjamin Recht s lecture, Samaneh s discussion

Data Mining Techniques

Latent Variable Models and EM Algorithm

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Latent Variable Models and EM algorithm

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Classification: The rest of the story

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Expectation maximization

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

HOMEWORK #4: LOGISTIC REGRESSION

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Clustering and Gaussian Mixture Models

Expectation Maximization

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

Clustering. Stephen Scott. CSCE 478/878 Lecture 8: Clustering. Stephen Scott. Introduction. Outline. Clustering.

Data Preprocessing. Cluster Similarity

Linear Models for Regression CS534

Variable selection and feature construction using methods related to information theory

Clustering VS Classification

Statistical Learning. Dong Liu. Dept. EEIS, USTC

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Lecture 8: Clustering & Mixture Models

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Lecture 11: Unsupervised Machine Learning

Parametric Techniques Lecture 3

CPSC 540: Machine Learning

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Statistical Machine Learning

Computational statistics

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Nonparametric Bayesian Methods (Gaussian Processes)

Application of a GA/Bayesian Filter-Wrapper Feature Selection Method to Classification of Clinical Depression from Speech Data

COM336: Neural Computing

CS 540: Machine Learning Lecture 1: Introduction

Parametric Techniques

Machine Learning Practice Page 2 of 2 10/28/13

Machine Learning Lecture 7

CS598 Machine Learning in Computational Biology (Lecture 5: Matrix - part 2) Professor Jian Peng Teaching Assistant: Rongda Zhu

Statistical Data Mining and Machine Learning Hilary Term 2016

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

DATA MINING AND MACHINE LEARNING. Lecture 4: Linear models for regression and classification Lecturer: Simone Scardapane

Data Mining and Analysis: Fundamental Concepts and Algorithms

Statistical Computing (36-350)

Ways to make neural networks generalize better

Part I. Linear regression & LASSO. Linear Regression. Linear Regression. Week 10 Based in part on slides from textbook, slides of Susan Holmes

Variables which are always unobserved are called latent variables or sometimes hidden variables. e.g. given y,x fit the model p(y x) = z p(y x,z)p(z)

Introduction to Machine Learning HW6

Gaussian Mixture Models

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Introduction to Machine Learning

CS6220: DATA MINING TECHNIQUES

MIXTURE MODELS AND EM

Bayesian Machine Learning

Probabilistic Machine Learning. Industrial AI Lab.

Based on slides by Richard Zemel

Linear Regression (continued)

Transcription:

Clustering Léon Bottou NEC Labs America COS 424 3/4/2010

Agenda Goals Representation Capacity Control Operational Considerations Computational Considerations Classification, clustering, regression, other. Parametric vs. kernels vs. nonparametric Probabilistic vs. nonprobabilistic Linear vs. nonlinear Deep vs. shallow Explicit: architecture, feature selection Explicit: regularization, priors Implicit: approximate optimization Implicit: bayesian averaging, ensembles Loss functions Budget constraints Online vs. offline Exact algorithms for small datasets. Stochastic algorithms for big datasets. Parallel algorithms. Léon Bottou 2/27 COS 424 3/4/2010

Introduction Clustering Assigning observations into subsets with similar characteristics. Applications medecine, biology, market research, data mining image segmentation search results topics, taxonomies communities Why is clustering so attractive? An embodiment of Descartes philosophy Discourse on the Method of Rightly Conducting One s Reason :... divide each of the difficulties under examination... as might be necessary for its adequate solution. Léon Bottou 3/27 COS 424 3/4/2010

Summary 1. What is a cluster? 2. K-Means 3. Hierarchical clustering 4. Simple Gaussian mixtures Léon Bottou 4/27 COS 424 3/4/2010

What is a cluster? Two neatly separated classes leave a trace in P {X}. Léon Bottou 5/27 COS 424 3/4/2010

Input space transformations Input space is often an arbitrary decision. For instance: camera pixels versus retina pixels. What happens if we apply a reversible transformation to the inputs? Léon Bottou 6/27 COS 424 3/4/2010

Input space transformations The Bayes optimal decision boundary moves with the transformation. The Bayes optimal error rate is unchanged. The neatly separated clusters are gone! Clustering depends on the arbitrary definition of the input space! This is very different from classification, regression, etc. Léon Bottou 7/27 COS 424 3/4/2010

K-Means The K-Means problem Given observations x 1... x n, determine K centroids w 1... w k n that minimize the distortion C(w) = min x i w k 2. k Interpretation Minimize the discretization error. i=1 Properties Non convex objective. Finding the global minimum is NP-hard in general. Finding acceptable local minima is surprisingly easy. Initialization dependent. Léon Bottou 8/27 COS 424 3/4/2010

Offline K-Means Lloyd s algorithm initialize centroids w k repeat - assign points to classes: i, s i arg min x i w k 2. k - recompute centroids: k, w k arg min x i w 2 = w i S k until convergence. S k {i : s i = k}. 1 card(s k ) i S k x i. Léon Bottou 9/27 COS 424 3/4/2010

Lloyd s algorithm Illustration Initial state: Squares = data points. Circles = centroids. Léon Bottou 10/27 COS 424 3/4/2010

Lloyd s algorithm Illustration 1. Assign data points to clusters. Léon Bottou 11/27 COS 424 3/4/2010

Lloyd s algorithm Illustration 2. Recompute centroids. Léon Bottou 12/27 COS 424 3/4/2010

Lloyd s algorithm Illustration Assign data points to clusters... Léon Bottou 13/27 COS 424 3/4/2010

Why does Lloyd s algorithm work? Consider an arbitrary cluster assignment s i. C(w) = n i=1 min k x i w k 2 = n x i w si 2 i=1 }{{} L(s,w) n x i w si 2 min x i w k 2 k } i=1 {{} D(s,w) 0 D D L D L L D L D Léon Bottou 14/27 COS 424 3/4/2010

Online K-Means MacQueen s algorithm initialize centroids w k and n k = 0. repeat - pick an observation x t and determine cluster s t = arg min x t w k 2. k - update centroid s t : n st n st + 1. w st w st + 1 n st ( xt w st ). until satisfaction. Comments MacQueen s algorithm finds decent clusters much faster. Final convergence could be slow. Do we really care? Just perform one or two passes over the randomly shuffled observations. Léon Bottou 15/27 COS 424 3/4/2010

Why does MacQueen s algorithm work? Explanation 1: Recursive averages. Let u n = 1 n x i. Then u n = u n 1 + 1 n n (x n u n 1 ). i=1 Explanation 2: Stochastic gradient. Apply stochastic gradient to C(w) = 1 2n ni=1 min k x i w k 2 : w st w st + γ t ( xt w st ) Explanation 3: Stochastic gradient + Newton. The Hessian H of C(w) is diagonal and contains the fraction of observations assigned to each cluster. w st w st + 1 t H 1( x t w st ) = wst + 1 n st ( xt w st ) Léon Bottou 16/27 COS 424 3/4/2010

Example: Color quantization of images Problem Convert a 24 bit RGB image into a indexed image with a palette of K colors. Solution The (r, g, b) values of the pixels are the observations x i The (r, g, b) values of the K palette colors are the centroids w k. Initialize the w k with an all-purpose palette Alternatively, initialize the w k with the color of random pixels. Perform one pass of MacQueen s algorithm Eliminate centroids with no observations. You are done. Léon Bottou 17/27 COS 424 3/4/2010

How many clusters? Rules of thumb? K = 10, K = n,... The Elbow method? Measure the distortion on a validation set. The distortion decreases when k increases. Sometimes there is no elbow, or several elbows Local minima mess the picture. Rate-distortion Each additional cluster reduces the distortion. Cost of additional cluster vs. cost of distortion. Just another way to select K. Conclusion Clustering is a very subjective matter. Léon Bottou 18/27 COS 424 3/4/2010

Hierarchical clustering Agglomerative clustering Initialization: each observation is its own cluster. Repeatedly merge the closest clusters single linkage D(A, B) = min x A, y B complete linkage D(A, B) = max distortion estimates, etc. x A, y B d(x, y) d(x, y) Divisive clustering Initialization: one cluster contains all observations. Repeatedly divide the largest cluster, e.g. 2-Means. Lots of variants. Léon Bottou 19/27 COS 424 3/4/2010

K-Means plus Agglomerative Clustering Algorithm Run K-Means with a large K. Count the number of observation for each cluster. Merge the closest clusters according to the following metric. Let A be a cluster with n A members and centroid w A. Let B be a cluster with n B members and centroid w B. The putative center of A B is w AB = (n A w A + n b w B )/(n A + n B ). Quick estimate of the distortion increase: d(a, B) = x w B 2 x w AB 2 x w A 2 x A B x A x B = n A w A w AB 2 + n B w B w AB 2 Léon Bottou 20/27 COS 424 3/4/2010

Dendogram Léon Bottou 21/27 COS 424 3/4/2010

Simple Gaussian mixture (1) Clustering via density estimation. Pick a parametric model P θ (X). Maximize likelihood. Pick a parametric model There are K components To generate an observation: a.) pick a component k with probabilities λ 1... λ K. b.) generate x from component k with probability N (µ i, σ). Notes Same standard deviation σ (for now). That s why I write Simple GMM. Léon Bottou 22/27 COS 424 3/4/2010

Simple Gaussian mixture (2) Parameters: θ = (λ 1, µ 1,..., λ K, µ K ) Model: P θ (Y = y) = λ y. P θ (X = x Y = y) = 1 σ (2π) d 2 e 1 2 ( ) x µ y 2 σ. Likelihood log L(θ) = n log P θ (X = x i ) = i=1 n log i=1 K P θ (Y = y)p θ (X = x i Y = y) =... y=1 Maximize! This is non convex. There are k! copies of each minimum (local or global). Conjugate gradients or Newton works. Léon Bottou 23/27 COS 424 3/4/2010

Expectation-Maximization Fortunately there is a simpler solution. We observe X We do not observe Y. Things would be simpler if we knew Y. Decomposition For a given X, guess a distribution Q(Y X). Regardless of our guess, log L(θ) = L(Q, θ) M(Q, θ) + D(Q, θ) L(Q, θ) = M(Q, θ) = D(Q, θ) = n i=1 n i=1 n i=1 K Q(y x i ) log P θ (x i y) y=1 K y=1 K y=1 Q(y x i ) log Q(y x i) P θ (y) Q(y x i ) log Q(y x i) P θ (y x i ) Gaussian log-likelihood KL divergence D(P Y Q Y X ) KL divergence D(Q Y X P Y X ) Léon Bottou 24/27 COS 424 3/4/2010

Expectation-Maximization Remember Lloyd s algorithm for K-Means? D D D L-M L-M L-M L-M D D ( ) xi µ 2 k σ E-Step Soft assignments q ik λ k e 1 2 M-Step Update parameters µ k i q ik x i i q. λ k ik i q ik iy q. iy Léon Bottou 25/27 COS 424 3/4/2010

Simple Gaussian mixture (3) Relation with K-Means Like K-Means but with soft assignments. Limit to K-Means when σ 0. In practice Clearly slower than K-Means. More robust to local minima. Annealing σ helps. Subtleties Relation between σ and the number of clusters... Relation between EM and Newton. Léon Bottou 26/27 COS 424 3/4/2010

Next Lecture Expectation Maximization in general EM for general Gaussian Mixtures Models. EM for all kinds of mixtures. EM for dealing missing data. Léon Bottou 27/27 COS 424 3/4/2010