Techinical Proofs for Nonlinear Learning using Local Coordinate Coding 1 Notations and Main Results Denition 1.1 (Lipschitz Smoothness) A function f(x) on R d is (α, β, p)-lipschitz smooth with respect to a norm if f(x ) f(x) α x x, and where we assume α, β > 0 and p (0, 1]. f(x ) f(x) f(x) (x x) β x x 1+p, Denition 1.2 (Coordinate Coding) A coordinate coding is a pair (γ, C), where C R d is a set of anchor points, and γ is a map of x R d to γ v (x)] R C such that v γ v(x) = 1. It induces the following physical approximation of x in R d : γ(x) = γ v (x)v. Moreover, for all x R d, we dene the coding norm as ( ) 1/2 x γ = γ v (x) 2. Proposition 1.1 The map x γ v(x)v is invariant under any shift of the origin for representing data points in R d if and only if v γ v(x) = 1. Lemma 1.1 (Linearization) Let (γ, C) be an arbitrary coordinate coding on R d. (α, β, p)-lipschitz smooth function. We have for all x R d : f(x) γ v (x)f(v) α x γ(x) + β γ v (x) v γ(x) 1+p. Let f be an Denition 1.3 (Localization Measure) Given α, β, p, and coding (γ, C), we dene Q α,β,p (γ, C) = E x α x γ(x) + β ] γ v (x) v γ(x) 1+p. 1
Denition 1.4 (Manifold) A subset M R d is called a p-smooth (p > 0) manifold with intrinsic dimensionality m = m(m) if there exists a constant c p (M) such that given any x M, there exists m vectors v 1 (x),..., v m (x) R d so that x M: m inf γ R m x x γ j v j (x) c p(m) x x 1+p. Denition 1.5 (Covering Number) Given any subset M R d, and ɛ > 0. The covering number, denoted as N (ɛ, M), is the smallest cardinality of an ɛ-cover C M. That is, sup x M inf x v ɛ. Theorem 1.1 (Manifold Coding) If the data points x lie on a compact p-smooth manifold M, and the norm is dened as x = (x Ax) 1/2 for some positive denite matrix A. Then given any ɛ > 0, there exist anchor points C M and coding γ such that C (1 + m(m))n (ɛ, M), Q α,β,p (γ, C) αc p (M) + (1 + m + 2 1+p m)β] ɛ 1+p. Moreover, for all x M, we have x 2 γ 1 + (1 + m) 2. Given a local-coordinate coding scheme (γ, C), we approximate each f(x) F a α,β,p by f(x) f γ,c (ŵ, x) = ŵ v γ v (x), where we estimate the coecients using ridge regression as: n ŵ v ] = arg min φ (f γ,c (w, x i ), y i ) + λ ] (w v g(v)) 2, (1) w v] i=1 Theorem 1.2 (Generalization Bound) Suppose φ(p, y) is Lipschitz: φ 1 (p, y) B. Consider coordinate coding (γ, C), and the estimation method (1) with random training examples S n = {(x 1, y 1 ),..., (x n, y n )}. Then the expected generalization error satises the inequality: E Sn E x,y φ(f γ,c (ŵ, x), y) inf f F α,β,p E x,y φ (f(x), y) + λ (f(v) g(v)) 2 ] + B2 2λn E x x 2 γ + BQ α,β,p (γ, C). Theorem 1.3 (Consistency) Suppose the data lie on a compact manifold M R d, and the norm is the Euclidean norm in R d. If loss function φ(p, y) is Lipschitz. As n, we choose α, β, α/n, β/n 0 (α, β depends on n), and p = 0. Then it is possible to nd coding (γ, C) using unlabeled data such that C /n 0 and Q α,β,p (γ, C) 0. If we pick λn, and λ C 0. Then the local coordinate coding method (1) with g(v) 0 is consistent as n : lim E S n n E x,y φ(f(ŵ, x), y) = inf E x,yφ (f(x), y). f:m R 2
2 Proofs 2.1 Proof of Proposition 1.1 Consider a change of the R d origin by u R d, which shifts any point x R d to x + u, and points v C to v + u. The shift-invariance requirement implies that after the change, we map x + u to γ v(x)v + u, which should equal γ v(x)(v + u). This is equivalent to u = γ v(x)u, which holds if and only if γ v(x) = 1. 2.2 Proof of Lemma 1.1 For simplicity, let γ v = γ v (x) and x = γ(x) = γ vv. We have f(x) γ v f(v) f(x) f(x ) + γ v (f(v) f(x )) = f(x) f(x ) + γ v (f(v) f(x ) f(x ) (v x )) This implies the bound. 2.3 Proof of Theorem 1.1 f(x) f(x ) + α x x 2 + β γ v (f(v) f(x ) f(x ) (v x )) γ v x v 1+p. Let m = m(m). Given any ɛ > 0, consider an ɛ-cover C of M with C N (ɛ, M). Given each u C, dene C u = {v 1 (u),..., v d (u)}, where v j (u) are dened in Denition 1.4. Dene the anchor points as C = u C {u + v j (u) : j = 1,..., m} C. It follows that C (1 + m)n (ɛ, M). In the following, we only need to prove the existence of a coding γ on M that satises the requirement of the theorem. Without loss of generality, we assume that v j (u) = ɛ for each u and j, and given u, {v j (u) : j = 1,..., m} are orthogonal with respect to A: vj (u)av k(u) = 0 when j k. For each x M, let u x C be the closest point to x in C. We have x u x ɛ by the denition of C. Now, Denition 1.4 implies that there exists γ j (x) (j = 1,..., m) such that m x u x γ j(x)v j (u x ) c p(m)ɛ 1+p. The optimal choice is the A-projection of x u x to the subspace spanned by {v j (u x ) : j = 1,..., m}. The orthogonality condition thus implies that m γ j(x) 2 v j (u x ) 2 x u x 2 ɛ 2. 3
Therefore which implies that for all x: m γ j(x) 2 1, m γ j(x) m. We can now dene the coordinate coding of x M as γ j v = u x + v j (u x ) γ v (x) = 1 m γ j v = u x 0 otherwise. This implies the following bounds: and x γ(x) c p (M)ɛ 1+p 1 j = 1 m γ j v γ(x) 1+p = γ ux (x) γ(x) u x + where we have used v u x = ɛ, and γ(x) u x x u x ɛ. 2.4 Proof of Theorem 1.2 m γ j(x) (v u x ) (γ(x) u x ) 1+p (1 + m m)ɛ 1+p γ j(x) (ɛ + ɛ) 1+p (3) =1 + m + 2 1+p m]ɛ 1+p. (4) Consider n + 1 samples S = {(x 1, y 1 ),..., (x, y )}. We shall introduce the following notation: 1 w v ] = arg min φ (f γ,c (w, x i ), y i ) + λ ] wv 2. (5) w v] n i=1 Let k be an integer randomly drawn from {1,..., n + 1}. Let ŵ v (k) ] be the solution of ŵ v (k) ] = arg min 1 φ (f γ,c (w, x i ), y i ) + λ w 2 v, w v] n i=1,...,;i k with the k-th example left-out. We have the following stability lemma from 1], which can be stated as follows using our terminology: (2) 4
Lemma 2.1 The following inequality holds f γ,c (ŵ (k), x k ) f γ,c ( w, x k ) x k 2 γ 2λn φ 1(f γ,c ( w, x k ), y k ). By using Lemma 2.1, we obtain for all α > 0: φ(f γ,c ( w, x k ), y k ) φ(f γ,c (ŵ (k), x k ), y k ) =φ(f γ,c ( w, x k ), y k ) φ(f γ,c (ŵ (k), x k ), y k ) φ 1(f γ,c (ŵ (k), x k ), y k )(f γ,c ( w, x k ) f γ,c (ŵ (k), x k )) + φ 1(f γ,c (ŵ (k), x k ), y k )(f γ,c ( w, x k ) f γ,c (ŵ (k), x k )) φ 1(f γ,c (ŵ (k), x k ), y k )(f γ,c ( w, x k ) f γ,c (ŵ (k), x k )) φ 1(f γ,c (ŵ (k), x k ), y k ) 2 x k 2 γ/(2λn) B 2 x k 2 γ/(2λn). In the above derivation, the rst inequality uses the convexity of φ(f, y) with respect to f, which implies that φ(f 1, y) φ(f 2, y) φ 1 (f 2, y)(f 1 f 2 ) 0. The second inequality uses Lemma 2.1, and the third inequality uses the assumption of the loss function. Now by summing over k, and consider any xed f F α,β,p, we obtain: φ(f γ,c (ŵ (k), x k ), y k ) n n φ(f γ,c ( w, x k ), y k ) + B ] 2λn x k 2 γ ( ) 1 φ γ v (x k )f(v), y k + λ ] f(v) 2 n 1 (f(x k ), y k ) + BQ(x k )] + λ n φ ] f(v) 2 + B2 2λn + B2 2λn x k 2 γ x k 2 γ, where Q(x) = α x γ(x) + β γ v(x) v γ(x) 1+p. In the above derivation, the second inequality follows from the denition of w as the minimizer of (5). The third inequality follows from Lemma 1.1. Now by taking expectation with respect to S, we obtain (n + 1)E S φ(f γ,c (ŵ (), x ), y ) n + 1 n n E x,yφ (f(x), y) + n + 1 n BQ α,β,p(γ, C) + λ ] f(v) 2 This implies the desired bound. 2.5 Proof of Theorem 1.3 + B2 (n + 1) E x x 2 2λn γ. Note that any measurable function f : M R can be approximated by F α,β,p with α, β and p = 0. Therefore we only need to show lim E S n n E x,y φ(f γ,c (ŵ, x), y) = lim inf E x,y φ (f(x), y). n f F α,β,p 5
Theorem 1.1 implies that it is possible to pick (γ, C) such that C /n 0 and Q α,β,p (γ, C) 0. Moreover, x γ is bounded. Given any f F α,β,0 and any n independent xed A > 0; if we let f A (x) = max(min(f(x), A), A), then it is clear that f A (x) F α,α+β,0. Therefore Theorem 1.2 implies that as n, E Sn E x,y φ(f γ,c (ŵ, x), y) E x,y φ (f A (x), y) + o(1). Since A is arbitrary, we let A to obtain the desired result. References 1] Tong Zhang. Leave-one-out bounds for kernel methods. Neural Computation, 15:1397 1437, 2003. 6