Techinical Proofs for Nonlinear Learning using Local Coordinate Coding

Similar documents
Nonlinear Learning using Local Coordinate Coding

Linear Regression and Its Applications

Improved Local Coordinate Coding using Local Tangents

Richard DiSalvo. Dr. Elmer. Mathematical Foundations of Economics. Fall/Spring,

STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION

1 Solutions to selected problems

' '-'in.-i 1 'iritt in \ rrivfi pr' 1 p. ru

Problem Set 6: Solutions Math 201A: Fall a n x n,

l(y j ) = 0 for all y j (1)

Mathematics 530. Practice Problems. n + 1 }

Midterm 1. Every element of the set of functions is continuous

CS229T/STATS231: Statistical Learning Theory. Lecturer: Tengyu Ma Lecture 11 Scribe: Jongho Kim, Jamie Kang October 29th, 2018

EC 521 MATHEMATICAL METHODS FOR ECONOMICS. Lecture 1: Preliminaries

Approximation by ridge functions with weights in a specified set

Second Order Elliptic PDE

Linear Analysis Lecture 5

Linear Algebra Massoud Malek

5 Banach Algebras. 5.1 Invertibility and the Spectrum. Robert Oeckl FA NOTES 5 19/05/2010 1

UCLA Chemical Engineering. Process & Control Systems Engineering Laboratory

Semi-strongly asymptotically non-expansive mappings and their applications on xed point theory

12 CHAPTER 1. PRELIMINARIES Lemma 1.3 (Cauchy-Schwarz inequality) Let (; ) be an inner product in < n. Then for all x; y 2 < n we have j(x; y)j (x; x)

Generalization theory

satisfying ( i ; j ) = ij Here ij = if i = j and 0 otherwise The idea to use lattices is the following Suppose we are given a lattice L and a point ~x

Projection Theorem 1

j=1 [We will show that the triangle inequality holds for each p-norm in Chapter 3 Section 6.] The 1-norm is A F = tr(a H A).

A SYNOPSIS OF HILBERT SPACE THEORY

Convex Analysis and Optimization Chapter 2 Solutions

INVERSE FUNCTION THEOREM and SURFACES IN R n

quantitative information on the error caused by using the solution of the linear problem to describe the response of the elastic material on a corner

Converse Lyapunov Functions for Inclusions 2 Basic denitions Given a set A, A stands for the closure of A, A stands for the interior set of A, coa sta

Plan of Class 4. Radial Basis Functions with moving centers. Projection Pursuit Regression and ridge. Principal Component Analysis: basic ideas

Introduction to Empirical Processes and Semiparametric Inference Lecture 13: Entropy Calculations

2 Garrett: `A Good Spectral Theorem' 1. von Neumann algebras, density theorem The commutant of a subring S of a ring R is S 0 = fr 2 R : rs = sr; 8s 2

Nonlinear Control. Nonlinear Control Lecture # 8 Time Varying and Perturbed Systems

Optimality Conditions for Nonsmooth Convex Optimization

Functional Analysis. Franck Sueur Metric spaces Definitions Completeness Compactness Separability...

ON TRIVIAL GRADIENT YOUNG MEASURES BAISHENG YAN Abstract. We give a condition on a closed set K of real nm matrices which ensures that any W 1 p -grad

Basic Theorems about Independence, Spanning, Basis, and Dimension

1. Subspaces A subset M of Hilbert space H is a subspace of it is closed under the operation of forming linear combinations;i.e.,

DIFFERENTIAL OPERATORS ON DOMAINS WITH CONICAL POINTS: PRECISE UNIFORM REGULARITY ESTIMATES

Variational Formulations

UNBOUNDED OPERATORS ON HILBERT SPACES. Let X and Y be normed linear spaces, and suppose A : X Y is a linear map.

A NOTE ON LINEAR FUNCTIONAL NORMS

Locally convex spaces, the hyperplane separation theorem, and the Krein-Milman theorem

Applied Analysis (APPM 5440): Final exam 1:30pm 4:00pm, Dec. 14, Closed books.

Problem Set 5: Solutions Math 201A: Fall 2016

From Fractional Brownian Motion to Multifractional Brownian Motion

6. Duals of L p spaces

i=1 β i,i.e. = β 1 x β x β 1 1 xβ d

Boosting with Early Stopping: Convergence and Consistency

Chapter 1. Optimality Conditions: Unconstrained Optimization. 1.1 Differentiable Problems

CHARACTERISTIC POINTS, RECTIFIABILITY AND PERIMETER MEASURE ON STRATIFIED GROUPS

Vector Spaces ปร ภ ม เวกเตอร

MARKOV CHAINS: STATIONARY DISTRIBUTIONS AND FUNCTIONS ON STATE SPACES. Contents

Linear Algebra (part 1) : Vector Spaces (by Evan Dummit, 2017, v. 1.07) 1.1 The Formal Denition of a Vector Space

Computation Of Asymptotic Distribution. For Semiparametric GMM Estimators. Hidehiko Ichimura. Graduate School of Public Policy

r( = f 2 L 2 (1.2) iku)! 0 as r!1: (1.3) It was shown in book [7] that if f is assumed to be the restriction of a function in C

Lecture 2: Review of Prerequisites. Table of contents

Honours Analysis III

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto

Measure and Integration: Solutions of CW2

Importance Sampling for Minibatches

In English, this means that if we travel on a straight line between any two points in C, then we never leave C.

A Proof of the EOQ Formula Using Quasi-Variational. Inequalities. March 19, Abstract

Pointwise convergence rate for nonlinear conservation. Eitan Tadmor and Tao Tang

Some Results on b-orthogonality in 2-Normed Linear Spaces

Algorithms for Nonsmooth Optimization

MA651 Topology. Lecture 9. Compactness 2.

Lecture # 3 Orthogonal Matrices and Matrix Norms. We repeat the definition an orthogonal set and orthornormal set.

Elements of Convex Optimization Theory

Convex Optimization Theory. Chapter 5 Exercises and Solutions: Extended Version

Transductive Experiment Design

New concepts: Span of a vector set, matrix column space (range) Linearly dependent set of vectors Matrix null space

Math 209B Homework 2

University of Missouri. In Partial Fulllment LINDSEY M. WOODLAND MAY 2015

Lecture 11 Hyperbolicity.

An introduction to some aspects of functional analysis

Topics in Mathematical Economics. Atsushi Kajii Kyoto University

Class 2 & 3 Overfitting & Regularization

ON THE REGULARITY OF SAMPLE PATHS OF SUB-ELLIPTIC DIFFUSIONS ON MANIFOLDS

Conditional Value-at-Risk (CVaR) Norm: Stochastic Case

On the Intrinsic Differentiability Theorem of Gromov-Schoen

Preparatory Material for the European Intensive Program in Bydgoszcz 2011 Analytical and computer assisted methods in mathematical models

Viscosity Iterative Approximating the Common Fixed Points of Non-expansive Semigroups in Banach Spaces

1 Cheeger differentiation

arxiv: v1 [math.oc] 21 Mar 2015

Fractional Roman Domination

i. v = 0 if and only if v 0. iii. v + w v + w. (This is the Triangle Inequality.)

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

NORMS ON SPACE OF MATRICES

Lebesgue Measure on R n

FUNCTIONAL ANALYSIS-NORMED SPACE

Sobolev Spaces. Chapter 10

Vector Space Basics. 1 Abstract Vector Spaces. 1. (commutativity of vector addition) u + v = v + u. 2. (associativity of vector addition)

Pavel Dvurechensky Alexander Gasnikov Alexander Tiurin. July 26, 2017

Chapter 2: Preliminaries and elements of convex analysis

LECTURE 16: LIE GROUPS AND THEIR LIE ALGEBRAS. 1. Lie groups

Value and Policy Iteration

Kernel Methods. Jean-Philippe Vert Last update: Jan Jean-Philippe Vert (Mines ParisTech) 1 / 444

Introduction to Statistical Learning Theory

Transcription:

Techinical Proofs for Nonlinear Learning using Local Coordinate Coding 1 Notations and Main Results Denition 1.1 (Lipschitz Smoothness) A function f(x) on R d is (α, β, p)-lipschitz smooth with respect to a norm if f(x ) f(x) α x x, and where we assume α, β > 0 and p (0, 1]. f(x ) f(x) f(x) (x x) β x x 1+p, Denition 1.2 (Coordinate Coding) A coordinate coding is a pair (γ, C), where C R d is a set of anchor points, and γ is a map of x R d to γ v (x)] R C such that v γ v(x) = 1. It induces the following physical approximation of x in R d : γ(x) = γ v (x)v. Moreover, for all x R d, we dene the coding norm as ( ) 1/2 x γ = γ v (x) 2. Proposition 1.1 The map x γ v(x)v is invariant under any shift of the origin for representing data points in R d if and only if v γ v(x) = 1. Lemma 1.1 (Linearization) Let (γ, C) be an arbitrary coordinate coding on R d. (α, β, p)-lipschitz smooth function. We have for all x R d : f(x) γ v (x)f(v) α x γ(x) + β γ v (x) v γ(x) 1+p. Let f be an Denition 1.3 (Localization Measure) Given α, β, p, and coding (γ, C), we dene Q α,β,p (γ, C) = E x α x γ(x) + β ] γ v (x) v γ(x) 1+p. 1

Denition 1.4 (Manifold) A subset M R d is called a p-smooth (p > 0) manifold with intrinsic dimensionality m = m(m) if there exists a constant c p (M) such that given any x M, there exists m vectors v 1 (x),..., v m (x) R d so that x M: m inf γ R m x x γ j v j (x) c p(m) x x 1+p. Denition 1.5 (Covering Number) Given any subset M R d, and ɛ > 0. The covering number, denoted as N (ɛ, M), is the smallest cardinality of an ɛ-cover C M. That is, sup x M inf x v ɛ. Theorem 1.1 (Manifold Coding) If the data points x lie on a compact p-smooth manifold M, and the norm is dened as x = (x Ax) 1/2 for some positive denite matrix A. Then given any ɛ > 0, there exist anchor points C M and coding γ such that C (1 + m(m))n (ɛ, M), Q α,β,p (γ, C) αc p (M) + (1 + m + 2 1+p m)β] ɛ 1+p. Moreover, for all x M, we have x 2 γ 1 + (1 + m) 2. Given a local-coordinate coding scheme (γ, C), we approximate each f(x) F a α,β,p by f(x) f γ,c (ŵ, x) = ŵ v γ v (x), where we estimate the coecients using ridge regression as: n ŵ v ] = arg min φ (f γ,c (w, x i ), y i ) + λ ] (w v g(v)) 2, (1) w v] i=1 Theorem 1.2 (Generalization Bound) Suppose φ(p, y) is Lipschitz: φ 1 (p, y) B. Consider coordinate coding (γ, C), and the estimation method (1) with random training examples S n = {(x 1, y 1 ),..., (x n, y n )}. Then the expected generalization error satises the inequality: E Sn E x,y φ(f γ,c (ŵ, x), y) inf f F α,β,p E x,y φ (f(x), y) + λ (f(v) g(v)) 2 ] + B2 2λn E x x 2 γ + BQ α,β,p (γ, C). Theorem 1.3 (Consistency) Suppose the data lie on a compact manifold M R d, and the norm is the Euclidean norm in R d. If loss function φ(p, y) is Lipschitz. As n, we choose α, β, α/n, β/n 0 (α, β depends on n), and p = 0. Then it is possible to nd coding (γ, C) using unlabeled data such that C /n 0 and Q α,β,p (γ, C) 0. If we pick λn, and λ C 0. Then the local coordinate coding method (1) with g(v) 0 is consistent as n : lim E S n n E x,y φ(f(ŵ, x), y) = inf E x,yφ (f(x), y). f:m R 2

2 Proofs 2.1 Proof of Proposition 1.1 Consider a change of the R d origin by u R d, which shifts any point x R d to x + u, and points v C to v + u. The shift-invariance requirement implies that after the change, we map x + u to γ v(x)v + u, which should equal γ v(x)(v + u). This is equivalent to u = γ v(x)u, which holds if and only if γ v(x) = 1. 2.2 Proof of Lemma 1.1 For simplicity, let γ v = γ v (x) and x = γ(x) = γ vv. We have f(x) γ v f(v) f(x) f(x ) + γ v (f(v) f(x )) = f(x) f(x ) + γ v (f(v) f(x ) f(x ) (v x )) This implies the bound. 2.3 Proof of Theorem 1.1 f(x) f(x ) + α x x 2 + β γ v (f(v) f(x ) f(x ) (v x )) γ v x v 1+p. Let m = m(m). Given any ɛ > 0, consider an ɛ-cover C of M with C N (ɛ, M). Given each u C, dene C u = {v 1 (u),..., v d (u)}, where v j (u) are dened in Denition 1.4. Dene the anchor points as C = u C {u + v j (u) : j = 1,..., m} C. It follows that C (1 + m)n (ɛ, M). In the following, we only need to prove the existence of a coding γ on M that satises the requirement of the theorem. Without loss of generality, we assume that v j (u) = ɛ for each u and j, and given u, {v j (u) : j = 1,..., m} are orthogonal with respect to A: vj (u)av k(u) = 0 when j k. For each x M, let u x C be the closest point to x in C. We have x u x ɛ by the denition of C. Now, Denition 1.4 implies that there exists γ j (x) (j = 1,..., m) such that m x u x γ j(x)v j (u x ) c p(m)ɛ 1+p. The optimal choice is the A-projection of x u x to the subspace spanned by {v j (u x ) : j = 1,..., m}. The orthogonality condition thus implies that m γ j(x) 2 v j (u x ) 2 x u x 2 ɛ 2. 3

Therefore which implies that for all x: m γ j(x) 2 1, m γ j(x) m. We can now dene the coordinate coding of x M as γ j v = u x + v j (u x ) γ v (x) = 1 m γ j v = u x 0 otherwise. This implies the following bounds: and x γ(x) c p (M)ɛ 1+p 1 j = 1 m γ j v γ(x) 1+p = γ ux (x) γ(x) u x + where we have used v u x = ɛ, and γ(x) u x x u x ɛ. 2.4 Proof of Theorem 1.2 m γ j(x) (v u x ) (γ(x) u x ) 1+p (1 + m m)ɛ 1+p γ j(x) (ɛ + ɛ) 1+p (3) =1 + m + 2 1+p m]ɛ 1+p. (4) Consider n + 1 samples S = {(x 1, y 1 ),..., (x, y )}. We shall introduce the following notation: 1 w v ] = arg min φ (f γ,c (w, x i ), y i ) + λ ] wv 2. (5) w v] n i=1 Let k be an integer randomly drawn from {1,..., n + 1}. Let ŵ v (k) ] be the solution of ŵ v (k) ] = arg min 1 φ (f γ,c (w, x i ), y i ) + λ w 2 v, w v] n i=1,...,;i k with the k-th example left-out. We have the following stability lemma from 1], which can be stated as follows using our terminology: (2) 4

Lemma 2.1 The following inequality holds f γ,c (ŵ (k), x k ) f γ,c ( w, x k ) x k 2 γ 2λn φ 1(f γ,c ( w, x k ), y k ). By using Lemma 2.1, we obtain for all α > 0: φ(f γ,c ( w, x k ), y k ) φ(f γ,c (ŵ (k), x k ), y k ) =φ(f γ,c ( w, x k ), y k ) φ(f γ,c (ŵ (k), x k ), y k ) φ 1(f γ,c (ŵ (k), x k ), y k )(f γ,c ( w, x k ) f γ,c (ŵ (k), x k )) + φ 1(f γ,c (ŵ (k), x k ), y k )(f γ,c ( w, x k ) f γ,c (ŵ (k), x k )) φ 1(f γ,c (ŵ (k), x k ), y k )(f γ,c ( w, x k ) f γ,c (ŵ (k), x k )) φ 1(f γ,c (ŵ (k), x k ), y k ) 2 x k 2 γ/(2λn) B 2 x k 2 γ/(2λn). In the above derivation, the rst inequality uses the convexity of φ(f, y) with respect to f, which implies that φ(f 1, y) φ(f 2, y) φ 1 (f 2, y)(f 1 f 2 ) 0. The second inequality uses Lemma 2.1, and the third inequality uses the assumption of the loss function. Now by summing over k, and consider any xed f F α,β,p, we obtain: φ(f γ,c (ŵ (k), x k ), y k ) n n φ(f γ,c ( w, x k ), y k ) + B ] 2λn x k 2 γ ( ) 1 φ γ v (x k )f(v), y k + λ ] f(v) 2 n 1 (f(x k ), y k ) + BQ(x k )] + λ n φ ] f(v) 2 + B2 2λn + B2 2λn x k 2 γ x k 2 γ, where Q(x) = α x γ(x) + β γ v(x) v γ(x) 1+p. In the above derivation, the second inequality follows from the denition of w as the minimizer of (5). The third inequality follows from Lemma 1.1. Now by taking expectation with respect to S, we obtain (n + 1)E S φ(f γ,c (ŵ (), x ), y ) n + 1 n n E x,yφ (f(x), y) + n + 1 n BQ α,β,p(γ, C) + λ ] f(v) 2 This implies the desired bound. 2.5 Proof of Theorem 1.3 + B2 (n + 1) E x x 2 2λn γ. Note that any measurable function f : M R can be approximated by F α,β,p with α, β and p = 0. Therefore we only need to show lim E S n n E x,y φ(f γ,c (ŵ, x), y) = lim inf E x,y φ (f(x), y). n f F α,β,p 5

Theorem 1.1 implies that it is possible to pick (γ, C) such that C /n 0 and Q α,β,p (γ, C) 0. Moreover, x γ is bounded. Given any f F α,β,0 and any n independent xed A > 0; if we let f A (x) = max(min(f(x), A), A), then it is clear that f A (x) F α,α+β,0. Therefore Theorem 1.2 implies that as n, E Sn E x,y φ(f γ,c (ŵ, x), y) E x,y φ (f A (x), y) + o(1). Since A is arbitrary, we let A to obtain the desired result. References 1] Tong Zhang. Leave-one-out bounds for kernel methods. Neural Computation, 15:1397 1437, 2003. 6