Grouping 2: Spectral and Agglomerative Clustering. CS 510 Lecture #16 April 2 nd, 2014

Similar documents
Machine Learning for Data Science (CS 4786)

Axis Aligned Ellipsoid

Mixtures of Gaussians and the EM Algorithm

5.1 Review of Singular Value Decomposition (SVD)

The picture in figure 1.1 helps us to see that the area represents the distance traveled. Figure 1: Area represents distance travelled

Distributional Similarity Models (cont.)

Distributional Similarity Models (cont.)

Expectation-Maximization Algorithm.

Algorithms for Clustering

Machine Learning for Data Science (CS 4786)

Lecture 2 Clustering Part II

Homework Set #3 - Solutions

Statistical Pattern Recognition

Factor Analysis. Lecture 10: Factor Analysis and Principal Component Analysis. Sam Roweis

Machine Learning Brett Bernstein

Lecture 13: Maximum Likelihood Estimation

Vector Quantization: a Limiting Case of EM

Introduction to Machine Learning DIS10

Chimica Inorganica 3

ECE 901 Lecture 13: Maximum Likelihood Estimation

Classification Using Decision Trees. Jackknife Estimator: Example 1. Data Mining. Jackknife Estimator: Example 2(cont. Jackknife Estimator: Example 2

Empirical Process Theory and Oracle Inequalities

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

Regression and generalization

5 Sequences and Series

62. Power series Definition 16. (Power series) Given a sequence {c n }, the series. c n x n = c 0 + c 1 x + c 2 x 2 + c 3 x 3 +

Lecture Notes for Analysis Class

Geometry of LS. LECTURE 3 GEOMETRY OF LS, PROPERTIES OF σ 2, PARTITIONED REGRESSION, GOODNESS OF FIT

Outline. CSCI-567: Machine Learning (Spring 2019) Outline. Prof. Victor Adamchik. Mar. 26, 2019

Chapter 2 The Monte Carlo Method

Spectral Graph Theory and its Applications. Lillian Dai Oct. 20, 2004

Symmetric Matrices and Quadratic Forms

We are mainly going to be concerned with power series in x, such as. (x)} converges - that is, lims N n

Algebra of Least Squares

Lecture 2: Monte Carlo Simulation

MATH 320: Probability and Statistics 9. Estimation and Testing of Parameters. Readings: Pruim, Chapter 4

Lecture 7: Density Estimation: k-nearest Neighbor and Basis Approach

Physics 324, Fall Dirac Notation. These notes were produced by David Kaplan for Phys. 324 in Autumn 2001.

Efficient GMM LECTURE 12 GMM II

The Method of Least Squares. To understand least squares fitting of data.

Pattern Classification, Ch4 (Part 1)

Lecture 8: October 20, Applications of SVD: least squares approximation

Recurrence Relations

The Expectation-Maximization (EM) Algorithm

PH 411/511 ECE B(k) Sin k (x) dk (1)

Machine Learning Regression I Hamid R. Rabiee [Slides are based on Bishop Book] Spring

The variance of a sum of independent variables is the sum of their variances, since covariances are zero. Therefore. V (xi )= n n 2 σ2 = σ2.

NYU Center for Data Science: DS-GA 1003 Machine Learning and Computational Statistics (Spring 2018)

BHW #13 1/ Cooper. ENGR 323 Probabilistic Analysis Beautiful Homework # 13

INF Introduction to classifiction Anne Solberg Based on Chapter 2 ( ) in Duda and Hart: Pattern Classification

18.S096: Homework Problem Set 1 (revised)

Infinite Sequences and Series

Agnostic Learning and Concentration Inequalities

Expectation maximization

Quadratic Functions. Before we start looking at polynomials, we should know some common terminology.

Jacob Hays Amit Pillay James DeFelice 4.1, 4.2, 4.3

Inverse Matrix. A meaning that matrix B is an inverse of matrix A.

PH 411/511 ECE B(k) Sin k (x) dk (1)

Lecture 7: Properties of Random Samples

MATH 10550, EXAM 3 SOLUTIONS

Inferential Statistics. Inference Process. Inferential Statistics and Probability a Holistic Approach. Inference Process.

Statistical Inference (Chapter 10) Statistical inference = learn about a population based on the information provided by a sample.

Topics Machine learning: lecture 3. Linear regression. Linear regression. Linear regression. Linear regression

1 Adiabatic and diabatic representations

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

Lecture 4. Hw 1 and 2 will be reoped after class for every body. New deadline 4/20 Hw 3 and 4 online (Nima is lead)

MATH 304: MIDTERM EXAM SOLUTIONS

Statistical and Mathematical Methods DS-GA 1002 December 8, Sample Final Problems Solutions

Simulation. Two Rule For Inverting A Distribution Function

Recursive Algorithms. Recurrences. Recursive Algorithms Analysis

A class of spectral bounds for Max k-cut

Math Solutions to homework 6

Linear Regression Demystified

5.1. The Rayleigh s quotient. Definition 49. Let A = A be a self-adjoint matrix. quotient is the function. R(x) = x,ax, for x = 0.

1 Last time: similar and diagonalizable matrices

2 Banach spaces and Hilbert spaces

Clustering: Mixture Models

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

Recall the study where we estimated the difference between mean systolic blood pressure levels of users of oral contraceptives and non-users, x - y.

Outline. Linear regression. Regularization functions. Polynomial curve fitting. Stochastic gradient descent for regression. MLE for regression

Lecture 22: Review for Exam 2. 1 Basic Model Assumptions (without Gaussian Noise)

Apply change-of-basis formula to rewrite x as a linear combination of eigenvectors v j.

Optimally Sparse SVMs

Econ 325/327 Notes on Sample Mean, Sample Proportion, Central Limit Theorem, Chi-square Distribution, Student s t distribution 1.

Frequentist Inference

Stat 421-SP2012 Interval Estimation Section

Chapter 0. Review of set theory. 0.1 Sets

Lesson 10: Limits and Continuity

5.74 TIME-DEPENDENT QUANTUM MECHANICS

Session 5. (1) Principal component analysis and Karhunen-Loève transformation

Maximum Likelihood Estimation and Complexity Regularization

A sequence of numbers is a function whose domain is the positive integers. We can see that the sequence

Chapter 22. Comparing Two Proportions. Copyright 2010, 2007, 2004 Pearson Education, Inc.

10-701/ Machine Learning Mid-term Exam Solution

STAT Homework 2 - Solutions

Summary. Recap ... Last Lecture. Summary. Theorem

Sequences A sequence of numbers is a function whose domain is the positive integers. We can see that the sequence

Statistics 511 Additional Materials

An Introduction to Randomized Algorithms

Transcription:

Groupig 2: Spectral ad Agglomerative Clusterig CS 510 Lecture #16 April 2 d, 2014

Groupig (review) Goal: Detect local image features (SIFT) Describe image patches aroud features SIFT, SURF, HoG, LBP, Group features to form codes Lots of features (from all traiig data) High dimesioal feature vectors (see above) Number K of codes (clusters) is kow 2

Groupig (review cot.) Two geerative models of clusterig: K-Meas Assumes all clusters have same variace i every dimesio Assumes all clusters have same variace as each other Expectatio Maximizatio (EM) Fits a arbitrary Gaussia to every cluster More geeral; ca cluster more complex cases Risk of over-fittig 3

Alterative Approaches If we do t eed to map to probabilities We do t eed to model uderlyig distributios We ca optimize other fuctio Goal Divide N samples ito K clusters (groups) Maximizig the similarity of samples withi groups Miimizig the similarity of samples betwee groups Problem Expoetially may possible groups Direct solutios are NP hard Therefore, we look for heuristic solutios 4

Preview Spectral Clusterig Divide data usig K processes Maximizig similarity withi groups Subject to artificial orthogoality costrait Agglomerative clusterig with Ward s Likage Recursively merge samples Miimizig variace withi groups Greedy heuristic 5

Spectral Clusterig 1. Defie a affiity (similarity) matrix A a i,j is large (~1) if samples a i ad a j are similar a i,j is small (but >= 0) if samples are dissimilar exp( a i a j L2 ) is commo Similarities below a threshold ofte set to 0 Affiity matrix must be symmetric 2. Defie a diagoal degree matrix D d ii = j a ij 6

Spectral Clusterig (II) 3. Defie the LaPlacia matrix L: L = D A What does this matrix look like? L has a importat property: f T Lf = 1 2 i, j=1 a ij ( f i f ) 2 j F is ay vector, but we will iterpret it as a vector of cluster labels 7

Proof f T Lf = f T Df f T Af = 2 d i f i f i f j a i, j i=1 i, j=1 = 1 # 2 d i f i 2 2 f i f j a i, j + d j f j 2 % $ i=1 i, j=1 j=1 = 1 2 a ij i, j=1 ( f i f ) 2 j & ( ' 8

Spectral Clusterig (III) 4. Now take the eigevectors of L Yes, compute LL T = RλR T Remember the defiitio of a eigevector: f i Lf T i = λ i So if a eigevector has eigevalue 0: ( f i f ) 2 j 0 = f T Lf = 1 a ij 2 i, j=1 I other words, all pairs are either Have the same label, or Have 0 similarity 9

Spectral Clusterig (IV) The umber of 0 eigevalues is the umber of discoected groups i L Every L has at least oe 0 eigevalue The correspodig eigevector is (1,1,,1) More geerally, eigevectors with small eigevalues miimize 1 2 a ij i, j=1 ( f i f ) 2 j 10

Spectral Clusterig (V) I other words, if samples are projected oto the eigevalues of L: The most similar samples will cluster Selectig K eigevectors geerates K orthogoal processes 5. Project data oto the K eigevectors with the smallest eigevalues 6. Cluster the projected samples usig K- Meas 11

Spectral Clusterig (summary) The eigevectors of L miimize Sice f s are labels, this tries to give similar samples similar labels Dissimilar samples ca have differet labels (a ij is 0 or small) Works best whe A cotais may 0s F s are ot itegers, however So K-Meas cleas it up 1 2 i, j=1 a ij ( f i f j ) 2 12

Spectral Clusterig (last SC slide, I promise) Efficiet way to exploit gaps without assumig distributios of samples Weakess: does t fid very small groups Groups with very few samples have too small a effect o the eige decompositio 13

Agglomerative Clusterig Iitialize every sample to be its ow cluster Groups of size 1 Iteratively fid the most similar pair of groups & merge them Util the umber of groups equals K Note that this is a simple Greedy search usig whatever fuctio (likage) measures similarity 14

Ward s Likage Miimize the total variace The sum of the squared distaces from every poit to its cluster ceter Iitial total variace is zero (sigleto groups) O every step, merge the two groups with the smallest gai i variace Gai A, B ( ) = Var A B ( ) Var A ( ) Var B ( ) 15

Agglomerative + Ward s Miimizes itra-class variace Assumes Euclidea Geometry Heuristic does ot fid global optimum 16