Clustering (Bishop ch 9)

Similar documents
Lecture Slides for INTRODUCTION TO. Machine Learning. ETHEM ALPAYDIN The MIT Press,

CHAPTER 7: CLUSTERING

Machine Learning 2nd Edition

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 4

CS 536: Machine Learning. Nonparametric Density Estimation Unsupervised Learning - Clustering

Anomaly Detection. Lecture Notes for Chapter 9. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

INTRODUCTION TO MACHINE LEARNING 3RD EDITION

Outline. Probabilistic Model Learning. Probabilistic Model Learning. Probabilistic Model for Time-series Data: Hidden Markov Model

CHAPTER 10: LINEAR DISCRIMINATION

Math 128b Project. Jude Yuen

Bayes rule for a classification problem INF Discriminant functions for the normal density. Euclidean distance. Mahalanobis distance

Lecture Slides for INTRODUCTION TO. Machine Learning. ETHEM ALPAYDIN The MIT Press,

Advanced Machine Learning & Perception

Learning Objectives. Self Organization Map. Hamming Distance(1/5) Introduction. Hamming Distance(3/5) Hamming Distance(2/5) 15/04/2015

Chapter 4. Neural Networks Based on Competition

Variants of Pegasos. December 11, 2009

Lecture 6: Learning for Control (Generalised Linear Regression)

Lecture VI Regression

Clustering with Gaussian Mixtures

Robust and Accurate Cancer Classification with Gene Expression Profiling

( ) [ ] MAP Decision Rule

Discrete Markov Process. Introduction. Example: Balls and Urns. Stochastic Automaton. INTRODUCTION TO Machine Learning 3rd Edition

Lecture 11 SVM cont

Fall 2010 Graduate Course on Dynamic Learning

An introduction to Support Vector Machine

Hidden Markov Models

Normal Random Variable and its discriminant functions

Machine Learning Linear Regression

CHAPTER 2: Supervised Learning

CHAPTER 5: MULTIVARIATE METHODS

Introduction to Boosting

John Geweke a and Gianni Amisano b a Departments of Economics and Statistics, University of Iowa, USA b European Central Bank, Frankfurt, Germany

In the complete model, these slopes are ANALYSIS OF VARIANCE FOR THE COMPLETE TWO-WAY MODEL. (! i+1 -! i ) + [(!") i+1,q - [(!

Introduction ( Week 1-2) Course introduction A brief introduction to molecular biology A brief introduction to sequence comparison Part I: Algorithms

Lecture Slides for INTRODUCTION TO. Machine Learning. ETHEM ALPAYDIN The MIT Press,

Dynamic Team Decision Theory. EECS 558 Project Shrutivandana Sharma and David Shuman December 10, 2005

Computing Relevance, Similarity: The Vector Space Model

Hidden Markov Models Following a lecture by Andrew W. Moore Carnegie Mellon University

Robustness Experiments with Two Variance Components

J i-1 i. J i i+1. Numerical integration of the diffusion equation (I) Finite difference method. Spatial Discretization. Internal nodes.

Solution in semi infinite diffusion couples (error function analysis)

V.Abramov - FURTHER ANALYSIS OF CONFIDENCE INTERVALS FOR LARGE CLIENT/SERVER COMPUTER NETWORKS

Lecture 2 L n i e n a e r a M od o e d l e s

Appendix to Online Clustering with Experts

TSS = SST + SSE An orthogonal partition of the total SS

Fitting a Conditional Linear Gaussian Distribution

Single-loop System Reliability-Based Design & Topology Optimization (SRBDO/SRBTO): A Matrix-based System Reliability (MSR) Method

Pattern Classification (III) & Pattern Verification

Probabilistic & Unsupervised Learning. Factored Variational Approximations and Variational Bayes. Expectations in Statistical Modelling

CS286.2 Lecture 14: Quantum de Finetti Theorems II

( ) () we define the interaction representation by the unitary transformation () = ()

A New Method for Computing EM Algorithm Parameters in Speaker Identification Using Gaussian Mixture Models

PHYS 1443 Section 001 Lecture #4

GMM parameter estimation. Xiaoye Lu CMPS290c Final Project

Graduate Macroeconomics 2 Problem set 5. - Solutions

An Integrated and Interactive Video Retrieval Framework with Hierarchical Learning Models and Semantic Clustering Strategy

Chapter 6: AC Circuits

Lecture 18: The Laplace Transform (See Sections and 14.7 in Boas)

WiH Wei He

Advanced time-series analysis (University of Lund, Economic History Department)

How about the more general "linear" scalar functions of scalars (i.e., a 1st degree polynomial of the following form with a constant term )?

Ordinary Differential Equations in Neuroscience with Matlab examples. Aim 1- Gain understanding of how to set up and solve ODE s

CSCE 478/878 Lecture 5: Artificial Neural Networks and Support Vector Machines. Stephen Scott. Introduction. Outline. Linear Threshold Units

Dishonest casino as an HMM

Appendix H: Rarefaction and extrapolation of Hill numbers for incidence data

Machine Learning 4771

Econ107 Applied Econometrics Topic 5: Specification: Choosing Independent Variables (Studenmund, Chapter 6)

PENALIZED LEAST SQUARES AND PENALIZED LIKELIHOOD

Linear Response Theory: The connection between QFT and experiments

Department of Economics University of Toronto

Clustering gene expression data & the EM algorithm

Digital Speech Processing Lecture 20. The Hidden Markov Model (HMM)

Comb Filters. Comb Filters

Consider processes where state transitions are time independent, i.e., System of distinct states,

Lecture Nov

UNIVERSITAT AUTÒNOMA DE BARCELONA MARCH 2017 EXAMINATION

Constrained-Storage Variable-Branch Neural Tree for. Classification

CHAPTER 10: LINEAR DISCRIMINATION

5th International Conference on Advanced Design and Manufacturing Engineering (ICADME 2015)

MACHINE LEARNING. Learning Bayesian networks

DEEP UNFOLDING FOR MULTICHANNEL SOURCE SEPARATION SUPPLEMENTARY MATERIAL

Chapters 2 Kinematics. Position, Distance, Displacement

Non-parametric techniques. Instance Based Learning. NN Decision Boundaries. Nearest Neighbor Algorithm. Distance metric important

Fast Universal Background Model (UBM) Training on GPUs using Compute Unified Device Architecture (CUDA)

This document is downloaded from DR-NTU, Nanyang Technological University Library, Singapore.

Forecasting Time Series with Multiple Seasonal Cycles using Neural Networks with Local Learning

January Examinations 2012

Statistical Paradigm

Professor Joseph Nygate, PhD

Including the ordinary differential of distance with time as velocity makes a system of ordinary differential equations.

Structural Optimization Using Metamodels

EM and Structure Learning

Gait Tracking and Recognition Using Person-Dependent Dynamic Shape Model

Li An-Ping. Beijing , P.R.China

Mechanics Physics 151

A Deterministic Algorithm for Summarizing Asynchronous Streams over a Sliding Window

2/20/2013. EE 101 Midterm 2 Review

Chapter Lagrangian Interpolation

New M-Estimator Objective Function. in Simultaneous Equations Model. (A Comparative Study)

Mechanics Physics 151

Transcription:

Cluserng (Bshop ch 9) Reference: Daa Mnng by Margare Dunham (a slde source) 1

Cluserng Cluserng s unsupervsed learnng, here are no class labels Wan o fnd groups of smlar nsances Ofen use a dsance measure (such as Eucldean dsance) for ds-smlary Can use cluser membershp/dsances as addonal (creaed) feaures 2

Cluserng Examples Segmen cusomer daabase based on smlar buyng paerns. Group houses n a own no neghborhoods based on feaures (locaon, sq f, sores, lo sze) Idenfy new plan speces Idenfy smlar Web usage paerns 3

Cluserng Problem Gven daa D={x 1,x 2,,x n } of feaure vecors and an neger value k, he Cluserng Problem s o defne a mappng where each x s assgned o one cluser K j, 1<=j<=k. A Cluser, K j, conans precsely hose vecors mapped o. Unlke classfcaon problem, clusers are no known a pror 4

Impac of Oulers on Cluserng Wha are he bes wo clusers? 5

Types of Cluserng Herarchcal Creaes Tree of cluserngs Agglomerave (boom up merge closes ) Dvsve (op down - less common) Paronal One se of clusers creaed, usually # of clusers suppled by user Clusers can be: Overlappng (sof) / Non-overlappng (hard) 6

Closes Clusers? Sngle Lnk: smalles dsance beween pons Complee Lnk: larges dsance beween pons Average Lnk: average dsance beween pons Cenrod: dsance beween cenrods 7

Levels of Cluserng 8

Dendrogram Dendrogram: a ree daa srucure whch llusraes herarchcal cluserng echnques. Each level shows clusers for ha level. Leaf ndvdual clusers Roo one cluser A cluser a level s he unon of s chldren clusers a level +1. 9

Paronal Cluserng Nonherarchcal - creaes one level of cluserng Snce only one se of clusers s oupu, he user normally has o npu he desred number of clusers, k. Somemes ry dfferen k and use bes one 10

Paronal Algorhms K-Means Gaussan mxures (EM) Many, many ohers 11

K-means cluserng 1. Pck k sarng means, µ 1, µ 2,, µ k Can use: randomly pcked examples, perurbaons of sample mean, or equally spaced along prncple componen 2. Repea unl convergence: 1. Spl daa no k ses, S 1, S 2,,S k where x S ff µ closes mean o x 2. Updae each µ o mean of S 12

Lecure Noes for E Alpaydın 2010 Inroducon o Machne Learnng 2e The MIT Press (V1.0) 13

K-Means Example Gven: {2,4,10,12,3,20,30,11,25}, k=2 Randomly assgn means: m 1 =3,m 2 =4 K 1 ={2,3}, K 2 ={4,10,12,20,30,11,25}, m 1 =2.5,m 2 =16 K 1 ={2,3,4},K 2 ={10,12,20,30,11,25}, m 1 =3,m 2 =18 K 1 ={2,3,4,10},K 2 ={12,20,30,11,25}, m 1 =4.75,m 2 =19.6 K 1 ={2,3,4,10,11,12},K 2 ={20,30,25}, m 1 =7,m 2 =25 Sop as he clusers wh hese means say he same. 14

Orgnal k=2 k=3 k=10 15

Tabular vew of k-means µ 1 µ 2 µ 3 µ 4 x 1 1 0 0 0 x 2 0 0 1 0 x 3 1 0 0 0 16

Sof k-means cluserng µ 1 µ 2 µ 3 µ 4 x 1.6.2.1.1 x 2.2.1.5.2 x 3.4.3.1.2 17

From Sof cluserng o EM Use weghed mean based on sofcluserng weghs Sof cluser weghs are probables: P(cluser x) Uses Bayes rule: P(cluser x) prop. o P(x cluser) P(cluser) For each x, he rue cluser for x s a laen (unobserved) varable 18

Sof cluserng o EM 2 Assume paramerc forms for P(cluser) (mulnomal) P(x cluser) (Gaussan) Ieravely: 1. Smoohly esmae he cluser membershp (laen varables) based on daa and old parameers 2. Updae parameers o maxmze he lkelhood of daa assumng new esmaes are ruh Ths s he mxure of Gaussan EM algorhm see hp://ceseer.s.psu.edu/blmes98genle.hml 19

1 2 3 4 P(c).3.2.2.3 P(x c) µ 1, σ 1 µ 2, σ 2 µ 3, σ 3 µ 4, σ 4 x 1 P(x c)p(c) Norm..2.1.1 x 2.2.1.5.2 x 3.4.3.1.2 20

Expecaon-Maxmzaon (EM) Log lkelhood wh a mxure model ( ) = log p L Φ X x Φ ( ) ( ) = log p( x G )P G k =1 Assume hdden varables z, whch when known, make opmzaon much smpler Complee lkelhood, L c (Φ X,Z), n erms of x and z Incomplee lkelhood, L(Φ X), n erms of x Lecure Noes for E Alpaydın 2010 Inroducon o Machne Learnng 2e The MIT Press (V1.0) 21

E- and M-seps Ierae he wo seps 1. E-sep: Esmae z gven X and curren Φ 2. M-sep: Fnd new Φ gven z, X, old Φ. E - sep : Q Φ Φ l ( ) X,Φ l ( ) = E[ L C Φ X,Z ] M - sep : Φ l +1 = argmaxq ( Φ Φ l ) Φ ( ) L( Φ l X ) L Φ l +1 X An ncrease n Q ncreases ncomplee lkelhood Lecure Noes for E Alpaydın 2010 Inroducon o Machne Learnng 2e The MIT Press (V1.0) 22

EM as lkelhood ascen new Lkelhood curren Q (use esmaed membershp) Φ: Parameers for mxure and all G 23

Hddens z = 1 f x belongs o G, 0 oherwse (labels r of supervsed learnng); assume p(x G )~N(µ, ) E-sep: M-sep: EM n Gaussan Mxures P G E[ z X,Φ l ] = ( ) = S l +1 = N h h j p( x G,Φ l )P( G ) p( x G j,φ l )P G j = P( G x,φ l ) h m l +1 = h h x x l +1 ( m ) x l +1 m h ( ) T ( ) Q uses esmaed z s (he h s) n place of unknown labels Lecure Noes for E Alpaydın 2010 Inroducon o Machne Learnng 2e The MIT Press (V1.0) 24

P(G 1 x)=h 1 =0.5 25

Problems wh EM Local mnma Run several mes, ake bes resul Use good nalzaon (perhaps k-means) Degenerae Gaussans - as σ goes o zero, he lkelhood goes o Fx a lower bound on σ Los of parameers o learn Use sphercal Gaussans or shared co-varance marces (or even fxed dsrbuons) 26

EM summary Ierave mehod for maxmzng lkelhood General mehod - no jus for Gaussan mxures, bu also HMMs, Bayes nes, ec. Generally works well, bu can have local mnma and degenerae suaons Ges boh cluserng and dsrbuon (mxure of Gaussans) - dsrbuons can be used for Bayesan learnng (e.g. learn P(x y) usng a gaussan mxure model) 27

EM summary Ierave mehod for maxmzng lkelhood General mehod - no jus for Gaussan mxures, bu also HMMs, Bayes nes, ec. Generally works well, bu can have local mnma and degenerae suaons Ges boh cluserng and dsrbuon (mxure of Gaussans) - dsrbuons can be used for Bayesan learnng (e.g. learn P(x y) usng a gaussan mxure model) 28

Expecaon-Maxmzaon (EM) Log lkelhood wh a mxure model L ( ) ( Φ X = log p x Φ) = log k = 1 ( x G ) P( G ) G a generave model - log of sum s ough Assume hdden varables z, whch when known, make opmzaon much smpler (ell whch G ) Complee lkelhood, L c (Φ X,Z), n erms of x and z Incomplee lkelhood, L(Φ X), n erms of x p 29

E- and M-seps Ierae he followng wo seps: E-sep: Esmae ds. for z gven X and curren Φ M-sep: Fnd new Φ gven z, X, and old Φ. E - sep : Q M - sep : Φ ( ) [ ( ) ] l l Φ Φ = E LC Φ X,Z X, Φ l + 1 = arg max Q( Φ Φ l ) An ncrease n Q ncreases ncomplee ( l + ) ( ) lkelhood L Φ 1 X L Φ l X Φ 30

31 EM n Gaussan Mxures Hdden z = 1 f x belongs o G, 0 oherwse assume p(x G )~N(μ, ) E-sep: M-sep: Use esmaed labels n place of unknown labels [ ] ( ) ( ) ( ) ( ) ( ) l j j l j l l h P P p P p, z E = = Φ Φ Φ Φ, G G, G G G, X x x x ( ) ( )( ) + + + + = = = T l l l l h h h h N h P 1 1 1 1 m x m x x m S G

EM n Gaussan Mxures Hddens z = 1 f x belongs o G, 0 oherwse assume p(x G )~N(μ, ) E - M - sep : sep : Q Φ ( ) [ ( ) ] l l Φ Φ = E LC Φ X,Z X, Φ l + 1 = arg max Q( Φ Φ l ) Φ L c (Φ X,Z) = log(p(x,z ) Φ) Q(Φ new Φ old ) = Z ( log(p(x,z ) Φ old )) P(Z X, Φ) 32

Afer Cluserng Dmensonaly reducon mehods fnd correlaons beween feaures and group feaures Cluserng mehods fnd smlares beween nsances and group nsances Allows knowledge exracon hrough number of clusers, pror probables, cluser parameers,.e., cener, range of feaures. 33

Cluserng as Preprocessng Esmaed group labels h j (sof) or b j (hard) may be seen as he dmensons of a new k dmensonal space, where we can hen learn our dscrmnan or regressor. Local represenaon (only one b j s 1, all ohers are 0; only few h j are nonzero) vs Dsrbued represenaon (Afer PCA; all z j are nonzero) 34

Mxure of Mxures In classfcaon, he npu comes from a mxure of classes (supervsed). If each class s also a mxure, e.g., of Gaussans, (unsupervsed), we have a mxure of mxures: p k ( x C ) = ( ) ( ) p x Gj P Gj p j = 1 K ( x) = p( x C ) ( ) P C = 1 35

Cluserng vs. Classfcaon Less pror knowledge Number of clusers (may be assumed) Meanng of clusers no assumed Unsupervsed learnng - no labels 36

Cluser Parameers m s h feaure vecor n cluser m 37

Cluserng Issues Ouler handlng Dynamc daa Inerpreng resuls Evaluang resuls Number of clusers Daa o be used Scalably 38

Herarchcal Cluserng Clusers are creaed n levels acually creang ses of clusers a each level. Agglomerave Inally each em n s own cluser Ieravely clusers are merged ogeher Boom Up Dvsve Inally all ems n one cluser Large clusers are successvely dvded Top Down 39