Vector Quantization: a Limiting Case of EM

Similar documents
Introduction to Machine Learning DIS10

Statistical Pattern Recognition

Chapter 2 The Monte Carlo Method

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5

4. Partial Sums and the Central Limit Theorem

Topic 9: Sampling Distributions of Estimators

Mixtures of Gaussians and the EM Algorithm

Algorithms for Clustering

Topic 9: Sampling Distributions of Estimators

Lecture 2: Monte Carlo Simulation

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS

ECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015

Infinite Sequences and Series

Optimally Sparse SVMs

1 Review of Probability & Statistics

Markov Decision Processes

1 Introduction to reducing variance in Monte Carlo simulations

Problem Set 4 Due Oct, 12

PRACTICE PROBLEMS FOR THE FINAL

MATH 320: Probability and Statistics 9. Estimation and Testing of Parameters. Readings: Pruim, Chapter 4

Pattern Classification, Ch4 (Part 1)

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

Generalized Semi- Markov Processes (GSMP)

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

OPTIMAL PIECEWISE UNIFORM VECTOR QUANTIZATION OF THE MEMORYLESS LAPLACIAN SOURCE

Lecture 2: April 3, 2013

Topic 9: Sampling Distributions of Estimators

Expectation and Variance of a random variable

INF Introduction to classifiction Anne Solberg Based on Chapter 2 ( ) in Duda and Hart: Pattern Classification


Clustering. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar.

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

10-701/ Machine Learning Mid-term Exam Solution

Math 113, Calculus II Winter 2007 Final Exam Solutions

This exam contains 19 pages (including this cover page) and 10 questions. A Formulae sheet is provided with the exam.

1.3 Convergence Theorems of Fourier Series. k k k k. N N k 1. With this in mind, we state (without proof) the convergence of Fourier series.

Tests of Hypotheses Based on a Single Sample (Devore Chapter Eight)

Stochastic Simulation

Optimization Methods MIT 2.098/6.255/ Final exam

Grouping 2: Spectral and Agglomerative Clustering. CS 510 Lecture #16 April 2 nd, 2014

Definitions and Theorems. where x are the decision variables. c, b, and a are constant coefficients.

Lecture Notes for Analysis Class

Lecture Chapter 6: Convergence of Random Sequences

MA131 - Analysis 1. Workbook 2 Sequences I

The picture in figure 1.1 helps us to see that the area represents the distance traveled. Figure 1: Area represents distance travelled

Axioms of Measure Theory

An Introduction to Randomized Algorithms

Assignment 1 : Real Numbers, Sequences. for n 1. Show that (x n ) converges. Further, by observing that x n+2 + x n+1

Sequences I. Chapter Introduction

CS284A: Representations and Algorithms in Molecular Biology

MATH 10550, EXAM 3 SOLUTIONS

Frequentist Inference

Lecture 7: Density Estimation: k-nearest Neighbor and Basis Approach

6.3 Testing Series With Positive Terms

Seunghee Ye Ma 8: Week 5 Oct 28

CS537. Numerical Analysis and Computing

The Expectation-Maximization (EM) Algorithm

Goodness-of-Fit Tests and Categorical Data Analysis (Devore Chapter Fourteen)

Axis Aligned Ellipsoid

Expectation-Maximization Algorithm.

Jacob Hays Amit Pillay James DeFelice 4.1, 4.2, 4.3

Convergence of random variables. (telegram style notes) P.J.C. Spreij

4.1 Sigma Notation and Riemann Sums

x x x Using a second Taylor polynomial with remainder, find the best constant C so that for x 0,

Properties and Hypothesis Testing

Lecture 19: Convergence

Simulation. Two Rule For Inverting A Distribution Function

6. Sufficient, Complete, and Ancillary Statistics

NUMERICAL METHODS FOR SOLVING EQUATIONS

Lecture 7: Properties of Random Samples

Statistical Inference Based on Extremum Estimators

Parameter, Statistic and Random Samples

MIDTERM 3 CALCULUS 2. Monday, December 3, :15 PM to 6:45 PM. Name PRACTICE EXAM SOLUTIONS

Sequences A sequence of numbers is a function whose domain is the positive integers. We can see that the sequence

Bayesian Methods: Introduction to Multi-parameter Models

CS321. Numerical Analysis and Computing

ECE 901 Lecture 13: Maximum Likelihood Estimation

ACO Comprehensive Exam 9 October 2007 Student code A. 1. Graph Theory

7.1 Convergence of sequences of random variables

62. Power series Definition 16. (Power series) Given a sequence {c n }, the series. c n x n = c 0 + c 1 x + c 2 x 2 + c 3 x 3 +

Lecture 5. Materials Covered: Chapter 6 Suggested Exercises: 6.7, 6.9, 6.17, 6.20, 6.21, 6.41, 6.49, 6.52, 6.53, 6.62, 6.63.

Lecture 2: Poisson Sta*s*cs Probability Density Func*ons Expecta*on and Variance Es*mators

Information-based Feature Selection

Stat410 Probability and Statistics II (F16)

MTH Assignment 1 : Real Numbers, Sequences

( ) = p and P( i = b) = q.

Lecture 12: November 13, 2018

1 Approximating Integrals using Taylor Polynomials

CEE 522 Autumn Uncertainty Concepts for Geotechnical Engineering

Chapter 6 Principles of Data Reduction

Limit Theorems. Convergence in Probability. Let X be the number of heads observed in n tosses. Then, E[X] = np and Var[X] = np(1-p).

Element sampling: Part 2

Discrete-Time Systems, LTI Systems, and Discrete-Time Convolution

THE SOLUTION OF NONLINEAR EQUATIONS f( x ) = 0.

1.010 Uncertainty in Engineering Fall 2008

7.1 Convergence of sequences of random variables

The Random Walk For Dummies

Exponential Families and Bayesian Inference

Machine Learning Assignment-1

The Method of Least Squares. To understand least squares fitting of data.

Transcription:

. Itroductio & defiitios Assume that you are give a data set X = { x j }, j { 2,,, }, of d -dimesioal vectors. The vector quatizatio (VQ) problem requires that we fid a set of prototype vectors Z = { z i }, i { 2,,, L}, L «, such that the total distortio D, D = j = mi i dist( x j, z i ) () is miimized. I equatio (), dist( x j, z i ) is a distace metric give by either, dist( xz, ) = x z (Euclidea distace), (2) or, more geerally, dist( xz, ) = ( x z) T Q( x z) (3) where Q is a positive-defiite, symmetric matrix, Q >. Weightig the distace alog each dimesio through the Q matrix ca ormalize the distace measure i equatio (3) with respect to differet scalig alog differet dimesios of the { x j } vectors. The vectors Z are kow as the VQ codebook. Two applicatios of v ector quatizatio are () redudat data compressio, ad (2) approximatig cotiuous probability distributios with approximate discrete oes (i.e. histograms), where each x j is replaced with the label i such that, dist( x j, z i ) dist( x j, z l ), l. (4) We will see later that this is especially useful i hidde Markov modelig. 2. The k -meas algorithm A. Algorithm defiitio The k -meas algorithm is a algorithm for geeratig Z the VQ codebook of prototype vectors. It is guarateed to coverge to a local miimum of D. The algorithm proceeds as follows:. Iitializatio: Choose some iitial settig for the L codes { z i } i the VQ codebook. Oe way to do this is to iitialize the { z i } to some radom subset of L vectors i X. 2. Classificatio: Classify each x j ito cluster or class ω i such that, dist( x j, z i ) dist( x j, z l ), l. (5) Loop 3. Codebook update: Update the code for every cluster ω i by computig its cetroid, z i = --- x j i x j ω i where i is the umber of vectors x j i cluster ω i. (6) 4. Termiatio: Stop whe the distortio D has decreased below some threshold level, or whe the algorithm has coverged to a costat level of distortio. - -

B. Example # Here, we ivestigate the covergece properties of the k -meas algorithm with a simple example. Let X be a set of = 2-dimesioal vectors { x j }, j { 2,,, }, distributed uiformly i the uit square, x, x 2 (7) as show i Figure below. x 2.8.6.4.2.2.4.6.8 Figure x Assumig ifiite data, the distortio for two codes { z, } is give by, D( z, ) = dist ( x j, z ) da + dist x ( j, ) da 2 A A 2 (8) where A i deotes the area i the uit square that is part of cluster ω i, i { 2, }. Therefore, the globally optimal solutio { z i other words, the miimum-distortio solutio for two codes, { z, } is give by, Deotig, { z = argmi { z, } [ D( z, )] z = { z, } ad = {, 2 }, () it appears that D( z, ) must be optimized over four idepedet scalars: { z,,, 2 }. Sice the data is distributed symmetrically about ( 2, 2), the optimal prototype vectors { z are, however, costraied by, = z (9) () 2 = (2) Therefore, we ca explicitly plot D( z, z ), where = {, }, as a fuctio of { z, }, as show i Figure 2 below. I Figure 2, red shades idicate the smallest distortios, ad we see that there are four globally optimal solutios { z : { z { z --,, (3) 2 4 --, --, 3 2 4 -- = { z --, 2 4 --, --, 3 2 4 -- = --,, (4) 4 2 -- 3, --, 4 2 -- = { z --, 4 2 -- 3, --, 4 2 -- = - 2 -

.8.6.4.2.2.4.6.8 Figure 2 Note that the four solutios are the same, except for a switch i the two axes, as well as a switch i the prototype vector labels. Now that we kow what the theoretical miimum-distortio two-code solutios are, we coduct the followig experimet. We ru the k meas algorithm for iitial radom prototypes i the iterval, (, ) z i (, ), i { 2, } (5) ad observe the values of { z, } to which the algorithm coverges. Figure 3 below plots the results of the trials. Note that all trials coverge to approximately the optimal solutios i (3) ad (4). The decisio regios betwee class ω ad ω 2 for each solutio pair { z, } is give by either, x.5 or x 2.5. (6) z x 2.8.6.4.2.2.4.6.8 Figure 3 x C. Example #2 Here, we ivestigate the covergece properties of the k -meas algorithm for the uiform distributio X of poits i the aular regio show i Figure 4 below, where the iside ad outside radii are give by r =.2 ad r 2 =.35, respectively. Sice the distributio is radially symmetric about the poit ( 2, 2), the locus of globally optimal (miimum-distortio) 2-code solutios is ecessarily described by the circle, ( x 2) 2 + ( x 2 2) 2 = r 2 (7) Oce agai assumig ifiite data, we ca compute the globally optimal value for r by recogizig that the two classes ω ad ω 2 ca be described by the solid ad dashed lies as idicated i Figure 4. Note that the - 3 -

x 2.8.6.4.2.2.4.6.8 Figure 4 x delieated regios are oly oe possible descriptio of ω ad ω 2 ; the regios ca, of course, be rotated by a agle θ, θ 2π, without loss of optimality. To compute r we ow simply have to compute the cetroid of the solid-lie regio let s call this regio A i the above figure so that, r = x 2 da A A da -- 2 (8) r = 789 --------------.65 3525π (9) Thus, the set of globally optimal solutios for the codes { z, } is give by, { z = {( 2+ rcosθ, 2+ rsiθ), ( 2+ rcos[ θ+ π], 2+ rsi[ θ+ π] )}, θ. (2) We ow coduct the followig experimet. We ru the k -meas algorithm for five iitial radom codes i the iterval, (, ) z i (, ), i { 2, } (2) ad observe the values of { z, } to which the algorithm coverges. The figure below plots the results of the five trials. Note that all five trials coverge to approximately the optimal solutio locus i (2). x 2.8.6.4.2.2.4.6.8 x Figure 5-4 -

D. Covergece I the previous two examples, we showed that the k -meas algorithm coverges to good ear-optimal solutios for some simple, idealized cases. It turs out that this covergece is, i fact, guarateed, sice the k - meas algorithm is simply a limitig case of the EM algorithm for estimatig the parameters of a mixture of Gaussias. Recall that for the EM algorithm, the reestimatio of the meas µ i (idetical to the z i here) is give by, P( ω i x j )x j j µ = i = ----------------------------------- P( ω i x j ) j = Assumig equal priors P( ω i ) ad equal variaces σ2 i = σ 2, it ca be easily show that, (22) lim µ i σ 2 P( ω i x j )x j j = = lim ----------------------------------- = --- x σ 2 j = z i i x P( ω i x j ) j ω i j = (23) Ituitively, as σ 2, p( x j µ i )» p( x j µ l ), i l, where, (24) D( x j, µ i ) < D( x j, µ l ) (25) so that the likelihoods p( x j µ l ), i l become isigificatly small compared to p( x j µ i ). Figure 6, for example, illustrates the covergece trajectory for the VQ ad EM algorithms for oe of the trials i example #. Note that the VQ ad EM trajectories appear very similar for σ =.. Sice the k -meas VQ algorithm is a limitig case of the EM algorithm, its covergece is also guarateed. Ad, because the VQ reestimatio equatios are much faster to compute the the EM reestimatio equatios, the VQ algorithm is sometimes preferred i practice..8.8.6.6.4.4.2.2.2.4.6.8.2.4.6.8 Example # VQ covergece. Figure 6 Example # EM covergece for σ =. ad σ =... Oe potetial problem i VQ algorithms is that durig covergece, oe or more clusters (or classes) might become empty. A typical solutio to this problem splits the cluster which curretly exhibits the largest distortio i two, ad reassigs the empty class to part of the large-distortio cluster. ω i - 5 -

3. The LBG VQ algorithm A. Algorithm descriptio The well kow LBG vector quatizatio (VQ) algorithm was proposed by Lide, Buzo ad Gray [] i 98. It addresses the problem of VQ codebook iitializatio by iteratively geeratig codebooks { z i }, i {,, 2 m }, m { 2,,, }, of icreasig size. The algorithm proceeds as outlied below. Note that the ier loop i the LBG VQ algorithm is equivalet to the k -meas algorithm for the curret value of L.. Iitializatio: Set L =, where L is the umber of VQ codes i Z, ad let z be the cetroid (e.g. mea) of the data set X. 2. Splittig: Split each VQ code z i ito two codes, { z i, z i+ L }, z i+ L = z i ad z,, (26) i = z i + i where, = ε{ b, b 2,, b d }, (27) Outer loop Ier loop ad ε is some small umber, typically.. The b k ca be set to all s or to a radom value of ±. Sice the umber of VQ codes i Z has bee doubled, let, L = 2L. Classificatio: Classify each x j ito cluster or class ω i such that, dist( x j, z i ) dist( x j, z l ), l. (29) 2. Codebook update: Update the code for every cluster ω i by computig its cetroid, z i = --- x j i x j ω i where i is the umber of vectors x j i cluster ω i. 3. Termiatio #: Stop whe the distortio D has decreased below some threshold level, or whe the algorithm has coverged to a costat level of distortio. (28) (3) 3. Termiatio #2: Stop whe L is the desired VQ codebook size. There are two mai advatages of the LBG VQ algorithm over the stadard k -meas algorithm. First, the algorithm is self-startig i the sese that problem-specific iitializatio is ot required. Secod, the LBG VQ algorithm automatically geerates codebooks of size 2 m, m { 2,,, }. This ca be useful whe we do ot kow a priori how large the VQ codebook eeds to be for a specific applicatio with a required maximum level of distortio. Presetly, the LBG VQ algorithm is probably the VQ algorithm used most ofte i a umber of differet applicatios. B. Example # Here, we illustrate the LBG algorithm with a simple example. Let X be a set of = 2-dimesioal vectors { x j }, j { 2,,, }, distributed uiformly i the uit square. Figure 7 illustrates the LBG VQ algorithm for the determiistic perturbatio vector = {.,.}. The codes for each L from to 2, are those at the ed of the ier loop of the 5 = 32 LBG algorithm, ad the lies i each plot delieate the regios of the 2-dimesioal space that are part of cluster ω i. - 6 -

Figure 7: The LBG vector quatizatio for some radom 2D data, as L equals, 2, 4, 8, 6 ad 32. Each ier loop usually coverges i oly a few steps. Cosider, for example, Figure 8, which illustrates the covergece of the ier loop as two codes are split ito four. Note that withi two steps (labeled arrows) the four codes are already located very close to their fial values. If a radomized perturbatio vector is used, the LBG algorithm may ed up with slightly differet codes Z. Cosider the two 32-prototype codebooks i Figure 9 below. These two VQ codebooks are geerated for the same uiform data X usig the LBG algorithm, but with differet radomized perturbatio vectors. Note that eve though the two codebooks are slightly differet, their total distortio D is almost the same. 2 2 2 2 Figure 8: The ier loop of the LBG VQ algorithm whe two codes are split ito four. - 7 -

.8.8.6.6.4.4.2.2.2.4.6.8.2.4.6.8 Figure 9 Fial distortio D = 2.9. Fial distortio D = 2.2. C. Example #2 Here, we illustrate the LBG algorithm with aother simple example. Let X be a set of = 2-dimesioal vectors { x j }, j { 2,,, }, distributed over the shaded regio i Figure. Figure illustrates the LBG VQ algorithm for the determiistic perturbatio vector = {.,.}. The codes for each L from to 2, are those at the ed of the ier loop of the 5 = 32 LBG algorithm, ad the lies i each plot delieate the regios of the 2-dimesioal space that are part of cluster ω i..8.8.8.6.6.6.4.4.4.2.2.2.2.4.6.8.2.4.6.8.2.4.6.8.8.8.8.6.6.6.4.4.4.2.2.2.2.4.6.8.2.4.6.8.2.4.6.8 Figure : The LBG vector quatizatio for some radom 2D data, as L equals, 2, 4, 8, 6 ad 32. - 8 -

D. Example #3: color-based object recogitio I this example, we are iterested i differetiatig betwee two differet model cars as they race at high speeds alog a toy race track, as show i Figure below. Because of the iterlaced ature of the NTSC sigal, ad the high scaled speed of the cars, the actual images of the cars are quite oisy, ad vary sigificatly depedig o where the cars are located alog the track. Figures 2 ad 3, for example, show three examples of cars # ad #2 as they actually appear i the digitized images. Here, we will use vector quatizatio to model each car as a discrete probability distributio over pixel color values, i order to discrimiate betwee the two cars. First, we record the RGB (red, gree, blue) pixel values for approximately 2 examples of each car; let us deote these as X ad X 2, respectively. Figure 4 plots Figure car # examples Figure 2 car #2 examples Figure 3-9 -

distributio of pixel values i RGB space for car # (light gray) ad car #2 (dark gray) traiig data Figure 4 correspodig vector codebook with 6 prototype vectors (computed with the LBG algorithm) these data sets i RGB space, where the light gray poits correspod to X, ad the dark gray poits correspod to X 2. We ow compute a 6-prototype vector codebook Z usig the LBG VQ algorithm, for the joit data set { X, X 2 }. The resultig VQ codebook, which is also plotted i Figure 4, is ext used to quatize both data sets X ad X 2. We the cout the frequecies of occurrece of each prototype vector { 2,,, 6} i each data set, ad ormalize to fit probabilistic costraits. The resultig discrete probability models, λ ad λ 2, are plotted i Figure 5 below, ad represet the discretized distributio of RGB color for each car. If we ow have a ukow car, as represeted by a collectio of RGB pixel values X = { xj }, j { 2,,, }, that we wat to classify as beig either car # or car #2, we ca evaluate the probability of X give each model λ ad λ 2, P( X λ k ) = P( x j λ k ) = Plλ ( k ), k { 2, }, (3) j = j = where l correspods to the VQ prototype vector label that is closest to x j, such that, dist( z l, x j ) dist( z i, x j ), i. (32) Of course, we will classify the ukow car as car # if P( X λ ) > P( X λ 2 ), ad as car #2 otherwise. I Figure 6, for example, we plot Pˆ ( X λ k ), k { 2, }, for the six car examples (three each) i Figures 2 ad 3, where, logp( X λ Pˆ ( X λ k ) k ) =, k { 2, }. (33) I other words, Pˆ ( X λ k ) simply represets the probability P( X λ k ) ormalized with respect to the umber of RGB values i X. Note from Figure 6 that the VQ-based probability models i Figure 5 give us very good discrimiatio betwee the two cars. - -

.75.5.25..75.5.25.75.5.25..75.5.25 2 3 4 5 6 7 8 9 2 3 4 5 6 2 3 4 5 6 7 8 9 2 3 4 5 6 VQ-based, discrete probability model for car # VQ-based, discrete probability model for car #2 Figure 5..8.8.6.6.4.4.2.2 2 3 2 3 Pˆ ( X λ ) (light gray) ad Pˆ ( X λ 2 ) (dark gray) for Pˆ ( X λ ) (light gray) ad Pˆ ( X λ 2 ) (dark gray) for car # examples i Figure 2 car #2 examples i Figure 3 Figure 6 [] Y. Lide, A. Buzo ad R. M. Gray, A Algorithm for Vector Quatizer Desig, IEEE Tras. o Commuicatio, vol. COM-28, o., pp. 84-95, 98. [2] X. D. Huag, Y. Ariki ad M. A. Jack, Hidde Markov Models for Speech Recogitio, Chapter 4, pp. -35, Ediburgh Uiversity Press, Ediburgh, 99. - -