Vector Quantization: a Limiting Case of EM

. Itroductio & defiitios Assume that you are give a data set X = { x j }, j { 2,,, }, of d -dimesioal vectors. The vector quatizatio (VQ) problem requires that we fid a set of prototype vectors Z = { z i }, i { 2,,, L}, L «, such that the total distortio D, D = j = mi i dist( x j, z i ) () is miimized. I equatio (), dist( x j, z i ) is a distace metric give by either, dist( xz, ) = x z (Euclidea distace), (2) or, more geerally, dist( xz, ) = ( x z) T Q( x z) (3) where Q is a positive-defiite, symmetric matrix, Q >. Weightig the distace alog each dimesio through the Q matrix ca ormalize the distace measure i equatio (3) with respect to differet scalig alog differet dimesios of the { x j } vectors. The vectors Z are kow as the VQ codebook. Two applicatios of v ector quatizatio are () redudat data compressio, ad (2) approximatig cotiuous probability distributios with approximate discrete oes (i.e. histograms), where each x j is replaced with the label i such that, dist( x j, z i ) dist( x j, z l ), l. (4) We will see later that this is especially useful i hidde Markov modelig. 2. The k -meas algorithm A. Algorithm defiitio The k -meas algorithm is a algorithm for geeratig Z the VQ codebook of prototype vectors. It is guarateed to coverge to a local miimum of D. The algorithm proceeds as follows:. Iitializatio: Choose some iitial settig for the L codes { z i } i the VQ codebook. Oe way to do this is to iitialize the { z i } to some radom subset of L vectors i X. 2. Classificatio: Classify each x j ito cluster or class ω i such that, dist( x j, z i ) dist( x j, z l ), l. (5) Loop 3. Codebook update: Update the code for every cluster ω i by computig its cetroid, z i = --- x j i x j ω i where i is the umber of vectors x j i cluster ω i. (6) 4. Termiatio: Stop whe the distortio D has decreased below some threshold level, or whe the algorithm has coverged to a costat level of distortio. - -

B. Example # Here, we ivestigate the covergece properties of the k -meas algorithm with a simple example. Let X be a set of = 2-dimesioal vectors { x j }, j { 2,,, }, distributed uiformly i the uit square, x, x 2 (7) as show i Figure below. x 2.8.6.4.2.2.4.6.8 Figure x Assumig ifiite data, the distortio for two codes { z, } is give by, D( z, ) = dist ( x j, z ) da + dist x ( j, ) da 2 A A 2 (8) where A i deotes the area i the uit square that is part of cluster ω i, i { 2, }. Therefore, the globally optimal solutio { z i other words, the miimum-distortio solutio for two codes, { z, } is give by, Deotig, { z = argmi { z, } [ D( z, )] z = { z, } ad = {, 2 }, () it appears that D( z, ) must be optimized over four idepedet scalars: { z,,, 2 }. Sice the data is distributed symmetrically about ( 2, 2), the optimal prototype vectors { z are, however, costraied by, = z (9) () 2 = (2) Therefore, we ca explicitly plot D( z, z ), where = {, }, as a fuctio of { z, }, as show i Figure 2 below. I Figure 2, red shades idicate the smallest distortios, ad we see that there are four globally optimal solutios { z : { z { z --,, (3) 2 4 --, --, 3 2 4 -- = { z --, 2 4 --, --, 3 2 4 -- = --,, (4) 4 2 -- 3, --, 4 2 -- = { z --, 4 2 -- 3, --, 4 2 -- = - 2 -

.8.6.4.2.2.4.6.8 Figure 2 Note that the four solutios are the same, except for a switch i the two axes, as well as a switch i the prototype vector labels. Now that we kow what the theoretical miimum-distortio two-code solutios are, we coduct the followig experimet. We ru the k meas algorithm for iitial radom prototypes i the iterval, (, ) z i (, ), i { 2, } (5) ad observe the values of { z, } to which the algorithm coverges. Figure 3 below plots the results of the trials. Note that all trials coverge to approximately the optimal solutios i (3) ad (4). The decisio regios betwee class ω ad ω 2 for each solutio pair { z, } is give by either, x.5 or x 2.5. (6) z x 2.8.6.4.2.2.4.6.8 Figure 3 x C. Example #2 Here, we ivestigate the covergece properties of the k -meas algorithm for the uiform distributio X of poits i the aular regio show i Figure 4 below, where the iside ad outside radii are give by r =.2 ad r 2 =.35, respectively. Sice the distributio is radially symmetric about the poit ( 2, 2), the locus of globally optimal (miimum-distortio) 2-code solutios is ecessarily described by the circle, ( x 2) 2 + ( x 2 2) 2 = r 2 (7) Oce agai assumig ifiite data, we ca compute the globally optimal value for r by recogizig that the two classes ω ad ω 2 ca be described by the solid ad dashed lies as idicated i Figure 4. Note that the - 3 -

x 2.8.6.4.2.2.4.6.8 Figure 4 x delieated regios are oly oe possible descriptio of ω ad ω 2 ; the regios ca, of course, be rotated by a agle θ, θ 2π, without loss of optimality. To compute r we ow simply have to compute the cetroid of the solid-lie regio let s call this regio A i the above figure so that, r = x 2 da A A da -- 2 (8) r = 789 --------------.65 3525π (9) Thus, the set of globally optimal solutios for the codes { z, } is give by, { z = {( 2+ rcosθ, 2+ rsiθ), ( 2+ rcos[ θ+ π], 2+ rsi[ θ+ π] )}, θ. (2) We ow coduct the followig experimet. We ru the k -meas algorithm for five iitial radom codes i the iterval, (, ) z i (, ), i { 2, } (2) ad observe the values of { z, } to which the algorithm coverges. The figure below plots the results of the five trials. Note that all five trials coverge to approximately the optimal solutio locus i (2). x 2.8.6.4.2.2.4.6.8 x Figure 5-4 -

D. Covergece I the previous two examples, we showed that the k -meas algorithm coverges to good ear-optimal solutios for some simple, idealized cases. It turs out that this covergece is, i fact, guarateed, sice the k - meas algorithm is simply a limitig case of the EM algorithm for estimatig the parameters of a mixture of Gaussias. Recall that for the EM algorithm, the reestimatio of the meas µ i (idetical to the z i here) is give by, P( ω i x j )x j j µ = i = ----------------------------------- P( ω i x j ) j = Assumig equal priors P( ω i ) ad equal variaces σ2 i = σ 2, it ca be easily show that, (22) lim µ i σ 2 P( ω i x j )x j j = = lim ----------------------------------- = --- x σ 2 j = z i i x P( ω i x j ) j ω i j = (23) Ituitively, as σ 2, p( x j µ i )» p( x j µ l ), i l, where, (24) D( x j, µ i ) < D( x j, µ l ) (25) so that the likelihoods p( x j µ l ), i l become isigificatly small compared to p( x j µ i ). Figure 6, for example, illustrates the covergece trajectory for the VQ ad EM algorithms for oe of the trials i example #. Note that the VQ ad EM trajectories appear very similar for σ =.. Sice the k -meas VQ algorithm is a limitig case of the EM algorithm, its covergece is also guarateed. Ad, because the VQ reestimatio equatios are much faster to compute the the EM reestimatio equatios, the VQ algorithm is sometimes preferred i practice..8.8.6.6.4.4.2.2.2.4.6.8.2.4.6.8 Example # VQ covergece. Figure 6 Example # EM covergece for σ =. ad σ =... Oe potetial problem i VQ algorithms is that durig covergece, oe or more clusters (or classes) might become empty. A typical solutio to this problem splits the cluster which curretly exhibits the largest distortio i two, ad reassigs the empty class to part of the large-distortio cluster. ω i - 5 -

3. The LBG VQ algorithm A. Algorithm descriptio The well kow LBG vector quatizatio (VQ) algorithm was proposed by Lide, Buzo ad Gray [] i 98. It addresses the problem of VQ codebook iitializatio by iteratively geeratig codebooks { z i }, i {,, 2 m }, m { 2,,, }, of icreasig size. The algorithm proceeds as outlied below. Note that the ier loop i the LBG VQ algorithm is equivalet to the k -meas algorithm for the curret value of L.. Iitializatio: Set L =, where L is the umber of VQ codes i Z, ad let z be the cetroid (e.g. mea) of the data set X. 2. Splittig: Split each VQ code z i ito two codes, { z i, z i+ L }, z i+ L = z i ad z,, (26) i = z i + i where, = ε{ b, b 2,, b d }, (27) Outer loop Ier loop ad ε is some small umber, typically.. The b k ca be set to all s or to a radom value of ±. Sice the umber of VQ codes i Z has bee doubled, let, L = 2L. Classificatio: Classify each x j ito cluster or class ω i such that, dist( x j, z i ) dist( x j, z l ), l. (29) 2. Codebook update: Update the code for every cluster ω i by computig its cetroid, z i = --- x j i x j ω i where i is the umber of vectors x j i cluster ω i. 3. Termiatio #: Stop whe the distortio D has decreased below some threshold level, or whe the algorithm has coverged to a costat level of distortio. (28) (3) 3. Termiatio #2: Stop whe L is the desired VQ codebook size. There are two mai advatages of the LBG VQ algorithm over the stadard k -meas algorithm. First, the algorithm is self-startig i the sese that problem-specific iitializatio is ot required. Secod, the LBG VQ algorithm automatically geerates codebooks of size 2 m, m { 2,,, }. This ca be useful whe we do ot kow a priori how large the VQ codebook eeds to be for a specific applicatio with a required maximum level of distortio. Presetly, the LBG VQ algorithm is probably the VQ algorithm used most ofte i a umber of differet applicatios. B. Example # Here, we illustrate the LBG algorithm with a simple example. Let X be a set of = 2-dimesioal vectors { x j }, j { 2,,, }, distributed uiformly i the uit square. Figure 7 illustrates the LBG VQ algorithm for the determiistic perturbatio vector = {.,.}. The codes for each L from to 2, are those at the ed of the ier loop of the 5 = 32 LBG algorithm, ad the lies i each plot delieate the regios of the 2-dimesioal space that are part of cluster ω i. - 6 -

Figure 7: The LBG vector quatizatio for some radom 2D data, as L equals, 2, 4, 8, 6 ad 32. Each ier loop usually coverges i oly a few steps. Cosider, for example, Figure 8, which illustrates the covergece of the ier loop as two codes are split ito four. Note that withi two steps (labeled arrows) the four codes are already located very close to their fial values. If a radomized perturbatio vector is used, the LBG algorithm may ed up with slightly differet codes Z. Cosider the two 32-prototype codebooks i Figure 9 below. These two VQ codebooks are geerated for the same uiform data X usig the LBG algorithm, but with differet radomized perturbatio vectors. Note that eve though the two codebooks are slightly differet, their total distortio D is almost the same. 2 2 2 2 Figure 8: The ier loop of the LBG VQ algorithm whe two codes are split ito four. - 7 -

.8.8.6.6.4.4.2.2.2.4.6.8.2.4.6.8 Figure 9 Fial distortio D = 2.9. Fial distortio D = 2.2. C. Example #2 Here, we illustrate the LBG algorithm with aother simple example. Let X be a set of = 2-dimesioal vectors { x j }, j { 2,,, }, distributed over the shaded regio i Figure. Figure illustrates the LBG VQ algorithm for the determiistic perturbatio vector = {.,.}. The codes for each L from to 2, are those at the ed of the ier loop of the 5 = 32 LBG algorithm, ad the lies i each plot delieate the regios of the 2-dimesioal space that are part of cluster ω i..8.8.8.6.6.6.4.4.4.2.2.2.2.4.6.8.2.4.6.8.2.4.6.8.8.8.8.6.6.6.4.4.4.2.2.2.2.4.6.8.2.4.6.8.2.4.6.8 Figure : The LBG vector quatizatio for some radom 2D data, as L equals, 2, 4, 8, 6 ad 32. - 8 -

D. Example #3: color-based object recogitio I this example, we are iterested i differetiatig betwee two differet model cars as they race at high speeds alog a toy race track, as show i Figure below. Because of the iterlaced ature of the NTSC sigal, ad the high scaled speed of the cars, the actual images of the cars are quite oisy, ad vary sigificatly depedig o where the cars are located alog the track. Figures 2 ad 3, for example, show three examples of cars # ad #2 as they actually appear i the digitized images. Here, we will use vector quatizatio to model each car as a discrete probability distributio over pixel color values, i order to discrimiate betwee the two cars. First, we record the RGB (red, gree, blue) pixel values for approximately 2 examples of each car; let us deote these as X ad X 2, respectively. Figure 4 plots Figure car # examples Figure 2 car #2 examples Figure 3-9 -

distributio of pixel values i RGB space for car # (light gray) ad car #2 (dark gray) traiig data Figure 4 correspodig vector codebook with 6 prototype vectors (computed with the LBG algorithm) these data sets i RGB space, where the light gray poits correspod to X, ad the dark gray poits correspod to X 2. We ow compute a 6-prototype vector codebook Z usig the LBG VQ algorithm, for the joit data set { X, X 2 }. The resultig VQ codebook, which is also plotted i Figure 4, is ext used to quatize both data sets X ad X 2. We the cout the frequecies of occurrece of each prototype vector { 2,,, 6} i each data set, ad ormalize to fit probabilistic costraits. The resultig discrete probability models, λ ad λ 2, are plotted i Figure 5 below, ad represet the discretized distributio of RGB color for each car. If we ow have a ukow car, as represeted by a collectio of RGB pixel values X = { xj }, j { 2,,, }, that we wat to classify as beig either car # or car #2, we ca evaluate the probability of X give each model λ ad λ 2, P( X λ k ) = P( x j λ k ) = Plλ ( k ), k { 2, }, (3) j = j = where l correspods to the VQ prototype vector label that is closest to x j, such that, dist( z l, x j ) dist( z i, x j ), i. (32) Of course, we will classify the ukow car as car # if P( X λ ) > P( X λ 2 ), ad as car #2 otherwise. I Figure 6, for example, we plot Pˆ ( X λ k ), k { 2, }, for the six car examples (three each) i Figures 2 ad 3, where, logp( X λ Pˆ ( X λ k ) k ) =, k { 2, }. (33) I other words, Pˆ ( X λ k ) simply represets the probability P( X λ k ) ormalized with respect to the umber of RGB values i X. Note from Figure 6 that the VQ-based probability models i Figure 5 give us very good discrimiatio betwee the two cars. - -

.75.5.25..75.5.25.75.5.25..75.5.25 2 3 4 5 6 7 8 9 2 3 4 5 6 2 3 4 5 6 7 8 9 2 3 4 5 6 VQ-based, discrete probability model for car # VQ-based, discrete probability model for car #2 Figure 5..8.8.6.6.4.4.2.2 2 3 2 3 Pˆ ( X λ ) (light gray) ad Pˆ ( X λ 2 ) (dark gray) for Pˆ ( X λ ) (light gray) ad Pˆ ( X λ 2 ) (dark gray) for car # examples i Figure 2 car #2 examples i Figure 3 Figure 6 [] Y. Lide, A. Buzo ad R. M. Gray, A Algorithm for Vector Quatizer Desig, IEEE Tras. o Commuicatio, vol. COM-28, o., pp. 84-95, 98. [2] X. D. Huag, Y. Ariki ad M. A. Jack, Hidde Markov Models for Speech Recogitio, Chapter 4, pp. -35, Ediburgh Uiversity Press, Ediburgh, 99. - -