Supplementary Material: A Single-Pass Algorithm for Efficiently Recovering Sparse Cluster Centers of High-dimensional Data

Size: px

Start display at page:

Download "Supplementary Material: A Single-Pass Algorithm for Efficiently Recovering Sparse Cluster Centers of High-dimensional Data"

Bennett McCoy
5 years ago
Views:

1 Supplementary Material: A Single-Pass Algorithm for Efficiently Recovering Sparse Cluster Centers of High-dimensional Data Jinfeng Yi JINFENGY@US.IBM.COM IBM Thomas J. Watson Research Center, Yorktown Heights, NY 0598, USA Lijun Zhang ZHANGLJ@LAMDA.NJU.EDU.CN National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 003, China Jun Wang WANGJUN@US.IBM.COM IBM Thomas J. Watson Research Center, Yorktown Heights, NY 0598, USA Rong Jin Anil K. Jain RONGJIN@CSE.MSU.EDU JAIN@CSE.MSU.EDU Department of Computer Science and Engineering, Michigan State University, East Lansing, MI 4884 USA Theorem. Let /(6m be a parameter to control the success probability. Assume max, ( s λ c s, ( ( 8 T max ln K µ 0, 3c ( η 0 6c3 σ λ, λ (ln n + ln d (3 where c, c and c 3 are some universal constants. Then, with a probability at least 6m, we have m+ = max i K ĉm+ i c i max (, c. m Corollary. The convergence rate for, the maximum difference between the optimal cluster centers and the estimated ones, is O( (s log d/n before reaching the optimal difference.. Proof of Corollary According to the assumption of λ in (, we know that λ the right side of (3, we have T s log d, which implies s. Since the value of T is dominated by the last term in n m T m s log d.

2 Combining with the conclusion m+, we have m m+ s log d n. Lemma. Let t be the maximum difference between the optimal cluster centers and the ones estimated from iteration t, and (0, be the failure probability. Assume t ρ S t 8 ln K µ 0, σ 5 ln (3K max, (4 λ t c exp ( ( t ρ 8( + t σ for some constants c, c and c 3. Then with a probability 6, we have. Proof of Lemma (η 0 + σ ln S t + c η 0 ln S t + c St + ln d 3σ, (6 St t+ sλ t. For the simplicity of analysis, we will drop the superscript t through this analysis... Preliminaries We denote by C k the support of c k and C k = [d] \ C k. For any vector z, z(c is defined as [z(c] i = z i if i C and zero, otherwise. For any x i S, we use k i to denote the index of the true cluster, and k i to denote index of the cluster assigned by the nearest neighbor search, i.e., x i =c ki + g i and g i N(0, σ I, ki = arg max ĉ j x i. j [K] Then, we can partition data points in S based on either the ground truth or the assigned cluster. Let S k be the subset of data points in S that belong to the k-th cluster, i.e., S k = {x i S : x i = c k + g i and g i N(0, σ I} (7 Let Ŝk be the subset of data points that are assigned to the k-th cluster based on the nearest neighbor search, i.e.,.. The Main Analysis Ŝ k = {x i S : k = arg max ĉ j x i } (8 j [K] Let L k (c be the objective function in Step of Algorithm. We expand L k (c as (5 L k (c =λ c + c c k + =λ c + c c k + x i c k x i c k (c c k (x i c k (c c k (c ki c k (c c k g i. \S k }{{}}{{} A k B k (9

3 Let c k be the optimal solution that minimizes L k(c, and define f k = c k c k. We have L k (c k L k (c k =λ f k + c k + f k fk A k fk B k λ c k λ c k λ f k (C k + λ f k (C k + f k fk A k fk B k λ c k λ f k (C k + λ f k (C k + f k f k A k f k B k = (λ + A k + B k f k (C k + (λ A k B k f k (C k + f k C k (λ + A k + B k f k (C k + (λ A k B k f k (C k + f k. Thus, if we have λ A k + B k, f k (C k f k (λ + A k + B k C k f k (C k λ C k f k (C k f k (C k λ C k, and thus f k λ C k f k (C k 4λ C k f k λ C k. In summary, if we have λ A k + B k, k [K] max k K c k c k sλ. In the following, we discuss how to bound A k and B k..3. Bound for A k From the definition of A k in (9, we have Ŝk \ S k A k η LOWER BOUND OF First, we show that the size of S k is lower-bounded, which means a significant amount of data points in S belong to the k-th cluster. Recall that µ,..., µ K are the weight of the Gaussian mixtures, and µ 0 = min µ i. According to the Chernoff i K bound (Angluin & Valiant, 979 provided in Appendix A, we have, with a probability at least ( S k µ k µ k ln K (5 3 µ k, k [K]. (0 Next, we prove that a larger amount of data points in S k belong to Ŝk. We begin by analyzing the probability that the assigned cluster k i of x i is the true cluster k i. The similarity between x i and the estimated cluster centers can be bounded by ĉ k i x i =ĉ k i (c ki + g i = c ki + [ĉ ki c ki ] c ki + ĉ k i g i ĉ ki c ki ĉ k i g i ( + ĉ ki ĉ ki, ĉ j x i =ĉ j (c ki + g i = c j c ki + [ c j ] c ki + ĉ j g i ρ + c j + ĉ j g i ρ + + ( +, j k i.

4 Hence, x i will be assigned to cluster k i if ( + which leads to the following sufficient condition max j K ρ ( + ĉ ki ĉ ki ρ + + ( + gi, j k i, (4 g 0 σ 5 ln(3k σ ln(3k. ( 3 It is easy to verify that for any fixed direction ĉ with ĉ =, gi c is a Gaussian random variable with mean 0 and variance σ. Based on the tail bound for the Gaussian distribution (Chang et al., 0 provided in Appendix B, we have [ ] ( Pr max j K g 0 K exp g 0 σ. Define In summary, we have proved the following lemma. ( δ = K exp g ( 0 σ 3. ( Lemma. Under the condition in (4, with a probability at least δ, x i = c ki + g i S ki S satisfies max j K g 0, and is assigned to the correct cluster k i based on the nearest neighbor search (i.e., k i = k i. Define S k = { x i S k : max j K } g 0 Ŝk S k. (3 Since each data point in S k has a probability at least δ to be assigned to set Sk, using the Chernoff bound again, we have, with a probability at least, Ŝk S k Sk E [ Sk ] ( E [ Sk ] ln K ( ( δ S k ( ( 3 S k 3 S k ln K ( δ S k ln K (5, (0 3 S k, k [K]. (4.3.. UPPER BOUND OF Ŝk \ S k Define O = K k=sk S and O = K k= (Ŝk \ Sk = S \ O S. From Lemma, we know that with a probability at least δ, each x i S k belongs to the set Sk O. Thus, with probability at least δ, each x i S belongs to O. In other words, with probability at most δ, each x i S belongs to O. Based on the Chernoff bound, we have, with a probability at least, O E [ O ] + ln δ + ln. (5 Since Sk S k, we have Ŝk \ S k Ŝk \ Sk O. Therefore, with a probability at least, we have Ŝk \ S k δ + ln, k [K]. (6

5 Combining (0, (4 and (6, we have, with probability at least 3 δ + ln A k η 0 9 µ = 8η ( 0 δ + k µ k ln ( η0 = O(δη 0 + O, k [K]. (7.4. Bound for B k Notice that {g i : x i Ŝk}, determined by the estimated centers ĉ,..., ĉ K, is a specific subset of {g i : x i S}. Although g i is drawn from the Gaussian distribution N(0, σ I, the distribution of elements in {g i : x i Ŝk} is unknown. As a result, we cannot direct apply concentration inequality of Gaussian random vectors to bound B k. Let U R d K be a matrix whose columns are basis vectors of the subspace spanned by ĉ,..., ĉ K, and U R d (d K be a matrix whose columns are basis vectors of the complementary subspace. We then divide each g i as where g i = U U g i, and g i = U U g i. First, we upper bound B k as B k g i } {{ } B k g i = g i + g i, + Ŝk\S k Ŝk\Sk g i \Sk }{{} B k In the following, we discuss how to bound each term in the right hand side of (8. + S k Sk g i x i S k }{{} B k 3. (8.4.. UPPER BOUND OF B k Following the property of Gaussian random vector, ( x U i Ŝk g i / σ can be treated as a (d K-dimensional ( Gaussian random vector. As a result, each element of U x U i Ŝk g i / σ is a Gaussian random variable with variance smaller than. Based on the tail bound for the Gaussian distribution (Chang et al., 0 provided in Appendix B and the union bound, with a probability at least, we have ( gi / σ = U ( U g i / σ ln Kd, k [K], which implies B k σ ln Kd (0, (4 ( ln Kd σ µ k /9 = O σ ln d, k [K]. (9.4.. UPPER BOUND OF B k First, we have Ŝk\Sk \S k g i = Ŝk\Sk \S k U U g i Ŝk\Sk \S k U g i (0 Since U g i /σ can be treated as a K-dimensional Gaussian random vector, based on the tail bound for the χ distribution (Laurent & Massart, 000, we have with a probability at least, ( K U g i σ + log

6 Applying the union bound again, with a probability at least, we have ( max U K g i σ + log i ( Combining (0 and (, we have B k 9σ ( δ + µ k ln ( K + log = O(δσ ( ln ln + O σ, k [K]. (.4.3. UPPER BOUND OF B 3 k First, we have Sk x i S k g i = U Sk U g i x i S k Sk U g i := u k (3 x i S k Recall the definition of S k in (3. Due to the fact that the domain is symmetric, we have E [ U g i ] = 0. Under the condition in (, we can invoke the following lemma to bound u k. Lemma 3. (Lemma from (Smale & Zhou, 007 Let H be a Hilbert space and ξ be a random variable on (Z, ρ with values in H. Assume ξ M < almost surely. Denote σ (ξ = E( ξ. Let {z i } m i= be independent random drawers of ρ. For any 0 < δ <, with confidence δ, m M ln(/δ σ (ξ ln(/δ (ξ i E[ξ i ] + m m m i= From Lemma 3 and the union bound, with a probability at least, we have ( ( K u k σ + log ln(k/ ln(k/ Sk + Sk, k [K]. (4 Combining (3 and (4, we have ( ( K B k 3 σ + log Sk ln K K + Sk ln ( ( (0, (4, (5 K σ + log 9 µ k ln K ln = O σ, k [K]. (5 In summary, under the condition that (0, (4 and (5 are true, with a probability at least 3, B k O(δσ ( ln + ln d ln + O σ, k [K]. (6 A. Chernoff Bound Theorem (Multiplicative Chernoff Bound (Angluin & Valiant, 979. Let X, X,..., X n be independent binary random variables with Pr[X i = ] = p i. Denote S = n i= X i and µ = E[S] = n i= p i. We have Pr [S ( µ] exp ( µ, for 0 < <, Pr [S ( + µ] exp ( + µ, for > 0.

7 Therefore, [ ( Pr S µ ln ] ( µ δ, for exp < δ <, δ µ Pr S µ + ln δ + ln δ + µ ln δ µ δ, for 0 < δ <. µ B. Tail bounds for the Gaussian distribution Theorem 3 (Chernoff-type upper bound for the Q-function (Chang et al., 0. The Q-function defined as Q(x = ( exp t dt π is the tail probability of the standard Gaussian distribution. When x > 0, we have Q(x ( exp x. Let X N (0, be a Gaussian random variable. According to Theorem 3, we have ( Pr [ X ] exp [ ] Pr X x ln δ δ., or References Angluin, D. and Valiant, L.G. Fast probabilistic algorithms for hamiltonian circuits and matchings. Journal of Computer and System Sciences, 8(:55 93, 979. Chang, Seok-Ho, Cosman, Pamela C., and Milstein, Laurence B. Chernoff-type bounds for the gaussian error function. IEEE Transactions on Communications, 59(: , 0. Laurent, B. and Massart, P. Adaptive estimation of a quadratic functional by model selection. The Annals of Statistics, 8 (5:30 338, 000. Smale, Steve and Zhou, Ding-Xuan. Learning theory estimates via integral operators and their approximations. Constructive Approximation, 6(:53 7, 007.

Large-scale Image Annotation by Efficient and Robust Kernel Metric Learning

Large-scale Image Annotation by Efficient and Robust Kernel Metric Learning Supplementary Material Zheyun Feng Rong Jin Anil Jain Department of Computer Science and Engineering, Michigan State University,