Dimensionality reduction in Hilbert spaces

Size: px

Start display at page:

Download "Dimensionality reduction in Hilbert spaces"

Marjory Eunice Adams
5 years ago
Views:

1 Dimesioality reductio i Hilbert spaces Maxim Ragisky October 3, 014 Dimesioality reductio is a geeric ame for ay procedure that takes a complicated object livig i a high-dimesioal (or possibly eve ifiite-dimesioal) space ad approximates it i some sese by a fiite-dimesioal vector. We are iterested i a particular class of dimesioality reductio methods. Cosider a data source that geerates vectors i some Hilbert space H, which is either ifiite-dimesioal or has a fiite but extremely large dimesio (thik R d with the usual Euclidea orm, where d is huge). We will assume that the vectors of iterest lie i the uit ball of H, { } B(H ) x H : x 1, where x x, x is the orm o H. We wish to represet each x B(H ) by a vector ŷ R k for some fixed k (if H is d-dimesioal, the of course we must have d k). For istace, k may represet some storage limitatio, such as a device that ca store o more tha k real umbers (or, more realistically, k double-precisio floatig-poit umbers, which for all practical purposes ca be thought of as real umbers). The mappig x ŷ ca be thought of as a ecodig rule. I additio, give ŷ R k, we eed a decodig rule that takes ŷ ad outputs a vector x H that will serve as a approximatio of x. I geeral, the cascade of mappigs x ecodig ŷ decodig x will be lossy, i.e., x x. So, the goal is to esure that the squared orm error x x is as small as possible. I this lecture, we will see how Rademacher complexity techiques ca be used to characterize the performace of a particular fairly broad class of dimesioality reductio schemes i Hilbert spaces. Our expositio here is based o a beautiful recet paper of Maurer ad Potil [MP10]. We will cosider a particular type of dimesioality reductio schemes, where the ecoder is a (oliear) projectio, whereas the decoder is a liear operator from R k ito H (the Appedix cotais some basic facts pertaiig to liear operators betwee Hilbert spaces). To specify such a scheme, we fix a pair (Y,T ) cosistig of a closed set Y R k ad a liear operator T : R k H. We call Y the codebook ad use the ecodig rule ŷ argmi x T y. (1) Uless Y is a closed subspace of R k, this ecodig map will be oliear. The decodig, o the other had, is liear: x T ŷ. With these defiitios, the recostructio error is give by x x f T (x) mi x T y. 1

2 Now suppose that the iput to our dimesioality reductio scheme is a radom vector X B(H ) with some ukow distributio P. The we measure the performace of the codig scheme (Y,T ) by its expected recostructio error ] L(T ) E P [f T (X )] E P [mi X T y (ote that, eve though the recostructio error depeds o the codebook Y, we do ot explicitly idicate this depedece, sice the choice of Y will be fixed by a particular applicatio). Now let T be some fixed class of admissible liear decodig maps T : R k H. So, if we kew P, we could fid the best decoder T T that achieves L (T ) if L(T ) (assumig, of course, that the ifimum exists ad is achieved by at least oe T T ). By ow, you kow the drill: We do t kow P, but we have access to a large set of samples X 1,..., X draw i.i.d. from P. So we attempt to lear T via ERM: T argmi argmi Our goal is to establish the followig result: 1 1 f T (X i ) mi X i T y. Theorem 1. Assume that Y is a closed subset of the uit ball B k { y R k : y 1 }, ad that every T T satisfies Te j α, T Y sup, y 0 1 j k T y α for some fiite α 1, where e 1,...,e k is the stadard basis of R k. The L( T ) L (T ) + 60α k + 4α log(1/δ) () with probability at lest 1 δ. I the special case whe Y {e 1,...,e k }, the stadard basis i R k, the evet L( T ) L (T ) + 40α k + 4α log(1/δ) (3) holds with probability at least 1 δ. Remark 1. The above result is slightly weaker tha the oe from [MP10]; as a cosequece, the costats i Eqs. () ad (3) are slightly worse tha they could otherwise be. 1 Examples Before we get dow to busiess ad prove the theorem, let s look at a few examples.

3 1.1 Pricipal compoet aalysis (PCA) The objective of PCA is, give k, costruct a projectio Π oto a k-dimesioal closed subspace of H to maximize the average eergy cotet of the projected vector: For ay x H, maximize E ΠX subject to dimπ(h ) k (4) Π Π Πx x (I Π)x, (5) where I is the idetity operator o H. To prove (5), expad the right-had side: x (I Π)x x x Πx x,πx Πx Πx, where the last step is by the properties of projectios. Thus, Πx x x Πx x mi x K x x, (6) where K is the rage of Π (the closure of the liear spa of all vectors of the form Πx, x H ). Moreover, ay projectio operator Π : H K with dim(h ) k ca be factored as T T, where T : R k H is a isometry (see Appedix for defiitios ad the proof of this fact). Usig this fact, we ca write { K Π(H ) T y : y R k}. Usig this i (6), we get Πx x mi y R k x T y. Hece, solvig the optimizatio problem (4) is equivalet to fidig the best liear decodig map T for the pair (Y,T ), where Y R k ad T is the collectio of all isometries T : R k H. Moreover, if we recall our assumptio that X B(H ) with probability oe, the we see that there is o loss of geerality if we take } Y B {y k R k : y 1, i.e., the uit ball i (R k, ). This follows from the fact that Πx x for ay projectio Π, so, i particular, for Π T T the ecodig ŷ i (1) ca be writte as ŷ T x, ad ŷ T ŷ T T x Πx x 1. Thus, Theorem 1 applies with α 1. That said, there are much tighter bouds for PCA that rely o deeper structural results pertaiig to fiite-dimesioal subspaces of Hilbert spaces, but that is beside the poit. The key idea here is that we ca already get ice bouds usig the tools already at our figertips. 3

4 1. Vector quatizatio or k-meas clusterig Vector quatizatio (or k-meas clusterig) is a procedure that takes a vector x H ad maps it to its earest eighbor i a fiite set C {ξ 1,...,ξ k } H, where k is a give positive iteger: x argmi x ξ. ξ C If X is radom with distributio P, the the optimal k-poit quatizer is give a size-k set C { ξ 1,..., ξ k } that miimizes the recostructio error ] E P [mi X ξ ξ C over all C H with C k. We ca cast the problem of fidig C i our framework by takig Y {e 1,...,e k } (the stadard basis i R k ) ad lettig T be the set of all liear operators T : R k H. It is easy to see that ay C H with C k ca be obtaied as a image of the stadard basis {e 1,...,e k } uder some liear operator T : R k H. Ideed, for ay C {ξ 1,...,ξ k }, we ca just defie a liear operator T : R k H by Te j ξ j, 1 j k ad the extedig it to all of R k by liearity: ( ) T y j e j y j Te j y j ξ j. So, aother way to iterpret the objective of vector quatizatio is as follows: give a distributio P supported o B(H ), we seek a k-elemet set C {ξ 1,...,ξ k } H, such that the radom vector X P ca be well-approximated o average by liear combiatios of the form y j ξ j, where the vector of coefficiets y (y 1,..., y k ) ca have oly oe ozero compoet, which is furthermore required to be equal to 1. I fact, there is o loss of geerality i assumig that C B(H ) as well. This is a cosequece of the fact that, for ay x B(H ) ad ay x H, we ca always fid some x B(H ) such that x x x x. Ideed, it suffices to take x argmi z B(H ) x z, ad it is ot hard to show that x x / x. Thus, Theorem 1 applies with α 1. Moreover, the excess risk grows liearly with dimesio k, cf. Eq. (3). It is ot kow whether this liear depedece o k is optimal there are Ω( k/) lower bouds for vector quatizatio, but it is still a ope questio whether these lower bouds are tight [MP10]. 4

5 1.3 Noegative matrix factorizatio Cosider approximatig the radom vector X P, where P is supported o the uit ball B(H ), by liear combiatios of the form y j ξ j, where the real vector y (y 1,..., y k ) is costraied to lie i the oegative orthat { } R k + y (y 1,..., y k ) R k : y j 0,1 j k, while the uit vectors ξ 1,...,ξ k B(H ) are costraied by the positivity coditio ξ j,ξ l H 0, 1 j,l k. This is a geeralizatio of the oegative matrix factorizatio (NMF) problem, origially posed by Lee ad Seug [LS99]. To cast NMF i our framework, let Y R k +, ad let T be the set of all liear operators T : Rk H such that (i) Te j 1 for all 1 j k ad (ii) Te j,te l 0 for all 1 j,l k. The the choice of T is equivalet to the choice of ξ 1,...,ξ k B(H ), as above. Moreover, it ca be show that, for ay x B(H ) ad ay T T, the miimum of x T y over all y R k + is achieved at some ŷ Rk + with ŷ 1. Thus, there is o loss of geerality if we take Y R k + Bk. I this case, the coditios of Theorem 1 are satisfied with α Sparse codig Take Y to be the l 1 uit ball B k 1 {y (y 1,..., y k ) R k : y 1 } y j 1, ad let T be the collectio of all liear operators T : R k H with Te j 1 for all 1 j k. I this case, the dimesioality reductio problem is to approximate a radom X B(H ) by a liear combiatio of the form y j ξ j, where y (y 1,..., y k ) R k satisfies the costrait y 1 1, while the vectors ξ 1,...,ξ k belog to the uit ball B(H ). The for ay y k y j e j Y we have T y y j Te j y j Te j y 1 max 1j k Te j 1, where the third lie is by Hölder s iequality. The the coditios of Theorem 1 are satisfied with α 1. 5

6 Proof of Theorem 1 Now we tur to the proof of Theorem 1. The format of the proof is the familiar oe: if we cosider the empirical recostructio error L (T ) 1 1 for every T T ad defie the uiform deviatio the f T (X i ) mi X i T y (X ) sup L (T ) L(T ), (7) L( T ) L (T ) + (X ). Now, for ay x B(H ), ay y Y, ad ay T T, we have 0 x T y x + T y 4α. Thus, the uiform deviatio (X ) has bouded differeces with c 1... c 4α /, so by McDiarmid s iequality, L( T ) L (T ) + E (X ) + 4α log(1/δ), (8) with probability at least 1 δ. By the usual symmetrizatio argumet, we obtai the boud E (X ) ER (F (X )), where F is the class of fuctios f T for all T T. Now, the whole affair higes o gettig a good upper boud o the Rademacher averages R (F (X )). We will do this i several steps, ad we eed to itroduce some additioal machiery alog the way..1 Gaussia averages Let γ 1,...,γ be i.i.d. stadard ormal radom variables. I aalogy to the Rademacher average of a bouded set A R, we ca defie the Gaussia average of A [BM0] as 1 G (A ) E γ sup γ i a i. Lemma 1 (Gaussia averages vs. Rademacher averages). π R (A ) G (A ). (9) 6

7 Proof. Let σ (σ 1,...,σ ) be a -tuple of i.i.d. Rademacher radom variables idepedet of γ. Sice each γ i is a symmetric radom variable, it has the same distributio as σ i γ i. Therefore, G (A ) 1 E γ sup γ i a i 1 E σ E γ sup σ i γ i a i 1 E σ sup σ i a i E γi γ i E γ 1 1 E σ sup σ i a i E γ 1 R (A ), where the secod step is by covexity, while i the last step we have used the fact that γ 1,...,γ are i.i.d. radom variables. Now, if γ is a stadard ormal radom variable, the Rearragig, we get (9). E γ 1 π 0 t e t / dt 1 te t / d t 1 0 π π π π. 0 te t / dt te t / dt Gaussia averages are ofte easier to work with tha Rademacher averages. The reaso for this is that, for ay real costats a 1,..., a, the sum W a a 1 γ a γ is a Gaussia radom variable with mea 0 ad variace a a. Moreover, for ay fiite collectio of vectors a(1),..., a (m) A, the radom variables W a (1),...,W a (m) are joitly Gaussia. Thus, the collectio of radom variables (W a ) is a zero-mea Gaussia process, where we say that a collectio of real-valued radom variables (W a ) is a Gaussia process if all fiite liear combiatios of the W a s are Gaussia radom variables. I particular, we ca compute covariaces: for ay a, a A, E[W a W a ] E [ ] γ i γ j a i a j E[γ i γ j ]a i a j a i a i a, a 7

8 ad thigs like [ ] E[(W a W a ) ] E γ i γ j (a i a i )(a j a j ) E [ ] γ i γ j (ai a i )(a j a j ) (a i a i ) a a. The latter quatities are hady because of a very useful result called Slepia s lemma [Sle6, LT91]: Lemma. Let (W a ) ad (V a ) be two zero-mea Gaussia processes with some idex set A (ot ecessarily a subset of R ), such that The E[(W a W a ) ] E[(V a V a ) ], a, a A. (10) E sup W a E sup V a. (11) Remark. The Gaussia processes (W a ),(V a ) that appear i Slepia s lemma are ot ecessarily of the form W a a,γ with γ (γ 1,...,γ ) a vector of idepedet Gaussias. They ca be arbitrarily collectios of radom variables idexed by the elemets of A, such that ay fiite liear combiatio of W a s or of V a s is Gaussia. Slepia s lemma is typically used to obtai upper bouds o the expected supremum of oe Gaussia process i terms of aother, which is hopefully easier to hadle. The oly wrikle is that we ca t apply Slepia s lemma to the problem of estimatig the Gaussia average G (A ) because of the absolute value. However, if all a A are uiformly bouded i orm, the absolute value makes little differece: Lemma 3. Let A R be a set of vectors uiformly bouded i orm, i.e., there exists some L < such that a L for all a A. Let The G (A ) 1 E [sup γ i a i ]. (1) L G (A ) G (A ) G (A ) + π. (13) Proof. The first iequality i (13) is obvious. For the secod iequality, pick a arbitrary a A, let W a γ i a i for ay a A, ad write G (A ) 1 ] [sup E W a 1 ] [sup E W a W a + 1 E W a. 8

9 Sice a was arbitrary, this gives { [ ] 1 G (A ) sup a A E sup W a W a + 1 } E W a. [ ] 1 E sup W a W a a,a A For ay two a, a, the radom variable W a W a is symmetric, so E [ sup W a W a a,a A ] + 1 sup E W a. (14) a A [ E sup W a ]. Moreover, for ay a A, W a is Gaussia with zero mea ad variace a L. Thus, sup E W a L E γ a A π L. Usig the two above formulas i (14), we get the secod iequality i (13), ad the lemma is proved. Armed with this lemma, we ca work with the quatity G (A ) istead of the Gaussia average G (A ). The advatage is that ow we ca rely o tools like Slepia s lemma.. Boudig the Rademacher average Now everythig higes o boudig the Gaussia average G (F (x )) for a fixed sample x (x 1,..., x ), which i tur will give us a boud o the Rademacher average R (F (x )), by Lemmas 1 ad 3. Let (γ i ) 1i, (γ i j ) 1i,1j k, ad (γ i j l ) 1i,1j,lk be mutually idepedet sequeces of i.i.d. stadard Gaussia radom variables. Defie the followig zero-mea Gaussia processes, idexed by T T : W T V T U T γ i f T (x i ), γ i j x i,te j, l1 Υ T 8V T + U T. γ i j l Te j,te l, By defiitio, G (F (x )) E sup 1 γ i f T (x i ) 1 E sup W T, 9

10 ad we defie G (F (x )) similarly. We will use Slepia s lemma to upper-boud G (F (x )) i terms of expected suprema of (V T ) ad (U T ). To that ed, we start with E [ (W T W T ) ] ( ft (x i ) f T (x i ) ) ( mi x i T y mi ( xi T y x i T y 8 ( max max ) x i T y ) xi,t y T y + T y T y max xi,t y T y + ) ( max T y T y ), (15) where i the third lie we have used properties of ier products, ad the last lie is by the iequality (a + b) a + b. Now, for each i, max xi,t y T y max y j x i,te j T e j max y xi,te j T e j xi,te j T e j, where i the secod step we have used Cauchy Schwarz. Summig over 1 i, we see that max xi,t y T y xi,te j T e j E [ (V T V T ) ]. (16) Similarly, Therefore, ( max T y T y ) max max ( l1 l1 max y 4 l1 y j y l Te j,te l T e j,t e l y j y l l1 l1 ) ( Te j,te l T e j,t e l ) ( Te j,te l T e j,t e l ) ( Te j,te l T e j,t e l ). ( max T y T y ) E [ (U T U T ) ]. (17) 10

11 Usig (16) ad (17) i (15), we have E [ (W T W T ) ] 8E [ (V T V T ) ] + E [ (U T U T ) ] E [ (Υ T Υ T ) ]. We ca therefore apply Slepia s lemma (Lemma ) to (W T ) ad (Υ T ) to write G (F (x )) 1 E sup W T Υ T 1 E sup 8 E sup V T + We ow upper-boud the expected suprema of V T ad U T. For the former, E sup V T E sup E sup γ i j x i,te j γ i j x i,te j E sup γ i j x i Te j E γ i j x i sup Te j α E γ i j x i α E γ i j γ i j x i, x i α α i 1 i 1 E sup U T. (18) (liearity) (Cauchy Schwarz) (assumptio o T ) (liearity) E [ ] γ i j γ i j xi, x i (Jese) x i (properties of i.i.d. Gaussias) αk. (x i B(H ) for all i ) 11

12 Similarly, for the latter, E sup U T E sup l1 l1 Substitutig these bouds ito (18), we have G (F (x )) 1 l1 γ i j l Te j,te l E sup γ i j l Te j,te l E α k π. γ i j l sup Te j Te l ( αk 8 + α k ) 5α k. π Thus, applyig Lemmas 1 ad 3, we have π R (F (x )) G (F (x )) π max G (F (x f T (x i ) )) + π π 15α k [ 10α k + π ] α Recallig (8), we see that the evet () holds with probability at least 1 δ. For the special case of k-meas clusterig, i.e., whe Y {e 1,...,e k }, we follow a slightly differet strategy. Defie a zero-mea Gaussia process Ξ T γ i j x i Te j, T T. The E [ (W T W T ) ] ( mi x i Te j mi i T e j 1j k 1j k ( max xi Te j x i T e j ) 1j k ( xi Te j x i T e j ) E [ (Ξ T Ξ T ) ]. ) 1

13 For the process (Ξ T ), we have E sup Ξ T E sup E sup E sup 3kα, γ i j x i Te j { γ i j xi x i,te j + Te j } { } γ i j x i,te j + γ i j Te j where the methods used to obtai this boud are similar to what we did for (V T ) ad (U T ). Usig Lemmas 1 3, we have π R (F (x )) G (F (x )) π π 10α k. G (F (x )) + [ 6α k + π max f T (x i ) π ] α Agai, recallig (8), we see that the evet (3) occurs with probability at least 1 δ. The proof of Theorem 1 is complete. A Liear operators betwee Hilbert spaces We assume, for simplicity, that all Hilbert spaces H of iterest are separable. By defiitio, a Hilbert space H is separable if it has a coutable dese subset: there exists a coutable set {h 1,h,...} H, such that for ay h H ad ay ε > 0 there exists some j N, for which h h j H < ε. Ay separable Hilbert space H has a coutable complete ad orthoormal basis, i.e., a coutable set {ϕ 1,ϕ,...} H with the followig properties: 1. Orthoormality ϕ i,ϕ j H δ i j ;. Completeess if there exists some h H which is orthogoal to all ϕ j s, i.e., h,ϕ j 0 for all j, the h 0. As a cosequece, ay h H ca be uiquely represeted as a ifiite liear combiatio h c j ϕ j, where c j h,ϕ j H, 13

14 where the ifiite series coverges i orm, i.e., for ay ε > 0 there exists some N, such that ϕ j ϕ j < ε. c H Moreover, h H c j. Let H ad K be two Hilbert spaces. A liear operator from H ito K is a mappig T : H K, such that (i) T (αh+α h ) αt h+α T h for ay two h,h H ad α,α R. A liear operator T : H K is bouded if T H K sup h H,h 0 T h K h H <. We will deote the space of all bouded liear operators T : H K by L (H,K ). Whe H K, we will write L (H ) istead. For ay operator T L (H,K ), we have the adjoit operator T L (K,H ), which is characterized by g,t h K T g,h H, g K,h H. If T L (H ) has the property that T T, we say that T is self-adjoit. Some examples: The idetity operator o H, deoted by I H, maps each h H to itself. I H is a self-adjoit operator with I H I H H H 1. We will ofte omit the idex H ad just write I. A projectio is a operator Π L (H ) satisfyig Π Π, i.e., Π(Πh) Πh for ay h H. This is a bouded operator with Π 1. Ay projectio is self-adjoit. A isometry is a operator T L (H,K ), such that T h K h H for all h H, i.e., T preserves orms. If T is a isometry, the T T I H, while T T L (K ) is a projectio. This is easy to see: (T T )(T T ) T (T T )T T T. If T L (H ) ad T L (H ) are both isometries, the T is called a uitary operator. If Π L (H ) is a projectio whose rage K H is a closed k-dimesioal subspace, the there exists a isometry T L (R k,k ), such that Π T T. Here, R k is a Hilbert space with the usual orm. To see this, let {ψ 1,...,ψ k } H be a orthoormal basis of K, ad complete it to a coutable basis {ψ 1,ψ,...,ψ k,ψ k+1,ψ k+,...} for the etire H. Here, the elemets of {ψ j } j k+1 are mutually orthoormal ad orthogoal to {ψ j } k. Ay h H has a uique represetatio h α j ψ j for some real coefficiets α 1,α,... With this, we ca write out the actio of Π explicitly as Πh α j ψ j. 14

15 Now cosider the map T : R k K that takes α (α 1,...,α k ) R k α j ψ j. It is easy to see that T is a isometry. Ideed, T α H j ψ j α k α j α. H The adjoit of T is easily computed: for ay α (α 1,...,α k ) R k ad ay h α j ψ j H, h,t α H Πh,T α H α j ψ j, α j ψ j α j α j T h,α. ( Sice this must hold for arbitrary α R k ad h H, we must have T h T ) α j ψ j (α 1,...,α j ). Now let s compute T h for ay h j α j ψ j : T T h T (T h) ( ( )) T T α j ψ j T ((α 1,...,α k )) α j ψ j Πh. Refereces [BM0] P. L. Bartlett ad S. Medelso. Rademacher ad Gaussia complexities: risk bouds ad structural results. Joural of Machie Learig Research, 3:463 48, 00. [LS99] D. D. Lee ad H. S. Seug. Learig the parts of objects by oegative matrix factorizatio. Nature, 401: , [LT91] M. Ledoux ad M. Talagrad. Probability i Baach Spaces: Isoperimetry ad Processes. Spriger, [MP10] A. Maurer ad M. Potil. K -dimesioal codig schemes i Hilbert spaces. IEEE Trasactios o Iformatio Theory, 56(11): , November 010. [Sle6] D. Slepia. The oe-sided barrier problem for Gaussia oise. Bell Systems Techical Joural, 41: ,

Regression with quadratic loss

Regression with quadratic loss Regressio with quadratic loss Maxim Ragisky October 13, 2015 Regressio with quadratic loss is aother basic problem studied i statistical learig theory. We have a radom couple Z = X,Y, where, as before,