arxiv: v1 [math.pr] 30 Mar 2015

Size: px

Start display at page:

Download "arxiv: v1 [math.pr] 30 Mar 2015"

Ruby Jefferson
5 years ago
Views:

1 DICIONARY LEARNING WIH FEW SAMPLES AND MARIX CONCENRAION KYLE LUH arxiv: v1 [math.pr] 30 Mar 2015 Department of Mathematics, Yae University VAN VU Department of Mathematics, Yae University Abstract. Let A be an n n matrix, X be an n p matrix and Y = AX. A chaenging and important probem in data anaysis, motivated by dictionary earning and other practica probems, is to recover both A and X, given Y. Under norma circumstances, it is cear that this probem is underdetermined. However, in the case when X is sparse and random, Spieman, Wang and Wright showed that one can recover both A and X efficienty from Y with high probabiity, given that p (the number of sampes) is sufficienty arge. heir method works for p Cn 2 og 2 n and they conjectured that p Cn og n suffices. he bound n og n is sharp for an obvious information theoretica reason. In this paper, we show that p Cn og 4 n suffices, matching the conjectura bound up to a poyogarithmic factor. he core of our proof is a theorem concerning 1 concentration of random matrices, which is of independent interest. Our proof of the concentration resut is based on two ideas. he first is an economica way to appy the union bound. he second is a refined version of Bernstein s concentration inequaity for the sum of independent variabes. Both have nothing to do with random matrices and are appicabe in genera settings. E-mai addresses: kye.uh@yae.edu, van.vu@yae.edu. Key words and phrases. Dictionary earning, matrix concentration. 1

2 2 DICIONARY LEARNING WIH FEW SAMPLES 1. Introduction Let A be an n n invertibe matrix and X be an n p matrix; set Y := AX. he aim of this paper is to study the foowing recovery probem: Given Y, reconstruct A and X. It is cear that in the equation (1.1) Y = AX, we have n 2 + np unknowns (the entries of A and X), and ony np equations (given by the entries of Y ). hus, the probem is underdetermined and one cannot hope for a unique soution. However, in practice, X is frequenty a sparse matrix. If X is sparse, the number of unknowns decreases dramaticay, as the majority of entries of X are zero. he name of the game here is to find the minimum vaue of p, the number of observations, which guarantees a unique recovery (e.g. [2] and [6]). One rea-ife appication that motivates the studies of this probem is dictionary earning. he matrix A can be seen as a hidden dictionary, with its coumns being the words. X is a sparse sampe matrix. his means that in the coumns of Y we observe inear combinations of a few coumns of A. From these observations, we woud ike to recover the dictionary. An archetypa exampe is facia recognition [18] [10]. A database of observed faces is used to generate the dictionary and once the dictionary is found, the probem of storing and transmitting facia images can be done very efficienty, as a one needs is to store and transmit few coefficients. In fact, such dictionary-earning techniques can be utiized to recognize faces that are partiay occuded or corrupted with noise [17]. For more discussion and rea-ife exampes, we refer to [9], [12] and the references therein. Another practica situation in which the recovery probem appears essentia is bind source separation and we refer the reader to [20] for more detais. here have been many approaches to efficient recovery beginning with the work of [12]. Let us mention, among others, onine dictionary earning by [11], SIV [7], the reative Newton method for source separation by [19], the Method of Optima Directions by [4], K-SVD in [1], and scaabe variants in [11]. Whie various different approaches have been considered, there have not been many rigorous resuts concerning performance. he first such resut has been obtained by Spieman, Wang and Wright [15] concerning recovery with random sampes; in other words, X is a random sparse matrix. Before stating their resut, we need to discuss the meaning of unique and the random mode. First, notice that if Y = AX, then Y = (AV )(V 1 X) for any diagona matrix V with non-zero diagona entries. Furthermore, one can freey permute the coumns of A and the rows of X accordingy whie keeping Y the same. In the rest of the paper, unique recovery wi be understood moduo these two operations. o mode X, one considers random Bernoui-subgaussian matrices, defined as foows: X is a matrix of size n p with iid entries x ij, where (1.2) x ij := χ ij ξ ij, where χ ij are iid indicator random variabes with P(χ ij ) = θ and ξ ij are iid random variabes with mean 0, variance bounded by 1, E ξ [1/10, 1], and P( ξ t) 2 exp( t 2 /2). his mode incudes many important distributions such as the standard Gaussians and Rademachers. he 1/10 is introduced for convenience of anaysis and not critica to the argument. Spieman et. a. proved

3 DICIONARY LEARNING WIH FEW SAMPLES 3 heorem 1.1. here are constants C > 0, C > 0 such that the foowing hods. Let A be an invertibe n n matrix and X a sparse random n p matrix with 2/n θ C / n and ξ ij having a symmetric distribution. hen for p Cn 2 og 2 n, one can efficienty find a soution with probabiity 1 o(1). Here and ater, efficient means poynomia time. he agorithm designed for this purpose is caed ER-SpUD, whose main subroutine is 1 optimization. We are going to present and discuss this agorithm in Section 4. In the dictionary earning probem, p is the number of measurements, and it is important to optimize its vaue. From beow, it is easy to see that we must have p cn og n for some constant c > 0. Indeed, if θ = 2/n (or c /n for any constant c ) and p < cn og n for a sufficienty sma constant c, then the coupon coector argument shows that with probabiity 1 o(1), X has an a-zero row. In this case, changing the corresponding coumn of A wi not effect Y, and an unique recovery is hopeess. Spieman et. a. conjecture Conjecture 1.2. here are constants C > 0, α > 0 such that the foowing hods. Let A be an invertibe n n matrix and X a sparse random n p matrix with 2/n θ α/ n. hen for p Cn og n, one can efficienty find a soution with probabiity 1 o(1). As a matter of fact, they beieve that ER-SpUD shoud perform we as ong as p Cn og n, for some arge constant C. hey aso proved that if one does not cared about the running time of the agorithm, then p Cn og n suffices. he anaysis in [15] bois down to the concentration probem. For a vector v R n, et µ v := E X v 1. Let c be a sma positive constant (c =.1 suffices) and et Bad(v) be the event that X v 1 µ v cµ v. We want to have (1.3) P( v R nbad(v)) = o(1). In other words, with high probabiity, X v 1 does not deviate significanty from its mean, simutaneousy for a v R n. One needs to find the smaest vaue of p which guarantees (1.3). Notice that X v is the sum of p iid random variabes X i v where X i are the rows of X. hus, intuitivey the arger p is, the more X v concentrates. From beow, we observe that (1.3) fais if p n 1, since in this case for any matrix X one can find a v such that X v = 0 and µ v 1 (we can take v arbitrariy ong). Spieman, Wang, and Wright [15] showed that p Cn 2 og 2 n suffices. We wi prove heorem 1.3. For any constant c > 0 there is a constant C > 0 such that (1.3) hods for any p Cn og 4 n. Beyond the current appication, heorem 1.3 may be of independent interest for severa reasons. Whie concentration inequaities for random matrices are abundant, most of them concern the spectra or 2 norm. We have not seen one which addresses the 1 norm as in this theorem. As sparsity pays crucia roe in data anaysis, techniques invoving 1 norm (such as 1 optimization) become more and more important. Furthermore, in the proof we introduce two genera ideas, which seem to be appicabe in many settings. he first is an economica way to appy the union bound and the second is a refined version of Bernstein s concentration inequaity for sums of independent variabes. Using heorem 1.3, we are abe to give an improved anaysis of ER-SpUD, which yieds heorem 1.4. here are constants C > 0, C > 0 such that the foowing hods. Let A be an invertibe n n matrix and X a sparse random n p matrix with 2/n θ C / n. hen for p Cn og 4 n, one can efficienty find a soution with probabiity 1 o(1). Our p is within a og 3 n factor from the bound in Conjecture 1.2. Furthermore, we can drop the assumption that ξ ij are symmetric from heorem 1.1.

4 4 DICIONARY LEARNING WIH FEW SAMPLES Next, we wi be abe to refine heorem 1.3 in two ways. First, combining the proof of heorem 1.4 with a resut from random matrix theory, we obtain the foowing more genera resut, which handes the case when A is rectanguar heorem 1.5. here are constants C, α > 0 such that the foowing hods. Let n > m and A be an n m matrix of rank m and and X a sparse random m p matrix with 2/n θ α/ n. hen for p Cn og 4 n, one can efficienty find a soution with probabiity 1 o(1) Second, in the sparest case θ := Θ(1/n), we deveop a new agorithm that obtains the optima bound p = Cn og n, proving Conjecture 1.2 in this regime. heorem 1.6. For any c > 0 there is a constant C > 0 such that the foowing hods. Let A be an invertibe n n matrix and X a sparse random n p matrix with θ = c/n. hen for p Cn og n, one can efficienty find a soution with probabiity 1 o(1) Finay, et us mention the issue of theoretica recovery, regardess the running time. Without the compexity issue, Spieman et. a. showed that p Cn og n suffices, given that the random variabe ξ ij in the definition of X has a symmetric distribution. We coud strengthen this theorem by removing this assumption. heorem 1.7. here are constants C > 0, C > 0 such that the foowing hods. Let A be an invertibe n n matrix and X a sparse random n p matrix with 2/n θ C / n. hen for p Cn og n, one can find a soution with probabiity 1 o(1). he rest of the paper is organized as foows. In Section 2, we present the main ideas behind the proof of heorem 1.3. he detais foows next in Section 3. Section 4 contains the accompanying agorithms and an improved anaysis of ER-SpUD, foowing [15]. Section 5 addresses a generaization to rectanguar dictionaries. Section 6 introduces a new agorithm that achieves the optima bound in the sparse regime. In Section 7, we prove heorem 1.7. We concude with Section 8, in which we present some numerica experiments of the various agorithms. Acknowedgement. We woud ike to thank D. Spieman for bringing the probem to our attention. 2. he main ideas and emmas 2.1. he standard ɛ-net argument. Let us reca our task. For a vector v R n, et µ v := E X v 1. Let c be a sma positive constant (c =.1 suffices) and et Bad(v) be the event that X v 1 µ v cµ v. We want to show that if p is sufficienty arge, then (2.1) P( v R nbad(v)) = o(1). For the sake of presentation, et us assume that the random variabes ξ ij are Rademacher (taking vaues ±1 with probabiity 1/2); the entries x ij of X have the form x ij = χ ij ξ ij, where χ ij are iid indicator variabes with mean θ. We start by a quick proof of the bound p Cn 2 og 2 n obtained in [15]. Notice that the union in (1.3) contains infinitey many terms. he standard way to hande this is to use an ɛ-net argument. Definition 2.1. A set N R n is an ɛ-net of a set D R n in q norm, for some 0 < q, if for any x D there is y N so that x y q ɛ. he unit sphere in q norm consists of vectors v where v q = 1. B denotes the unit sphere in 1 norm. Considering the vectors in B is sufficient to prove the resut. It is easy to show that for any v B µ min := p θ/n µ v pθ := µ max,

5 DICIONARY LEARNING WIH FEW SAMPLES 5 where the ower bounds attend at v = 1 n1 (1 is the a one vector) and the upper bound at v = (1, 0,..., 0). Let N 0 be the set of a vectors in B whose coordinates are integer mutipes of n 3. Any vector in B woud be of distance at most n 2 in 1 norm from some vector in N 0 (thus N 0 is an n 2 -net of B). A short consideration shows that if u, v B are within n 2 of each other, then hus, to prove (1.3), it suffices to show that µ v µ u = o(µ min ). (2.2) P( v N0 Bad(v)) = o(1). In order to bound P( v N0 Bad(v)), et us first bound P(Bad(v)) for any B. Notice that X v 1 = p X i v, where X i are the coumns of X. he random variabes X i v are iid, and one is poised to appy another standard too, Bernstein s inequaity for the sum of independent random variabes. Lemma 2.2. Let Z 1,..., Z n be independent random variabes such that Z i τ with probabiity 1. Let S := n i=1 Z i. hen for any > 0 max{p(s ES ), P(S ES )} exp( 2(VarS + τ) ) exp( min{ 2 4VarS, 4τ }). In our case Z i = X i v = n i=1 X ijv j. As x ij = χ ij ξ ij 1 with probabiity 1 (we assume that ξ ij are Rademacher) Z i i=1 2 n v j = v 1 = 1 with probabiity 1. his means we can set τ = 1. Furthermore Var i=1 p Z i = pvarz i pe X i v 2 = p i=1 n θvj 2 pθ Finay, one can set = cµ min = cp θ/n. Lemma 2.2 impies that since θ/n 1/n as θ 1/n. Using the union bound P(Bad(v) 2 exp( min{ c2 p 2 θ/n 4pθ (2.3) P( v N0 Bad(v) we obtain j=1 n v j = pθ. j=1, cp θ/n }) = 2 exp( c2 p 4 4n ) v N 0 P(Bad(v)) P( v N0 Bad(v)) N 0 2 exp( c2 p 4n ). It is easy to check that N 0 = exp(ω(n og n)). So, in order to make the RHS o(1), we need p Cn 2 og n for a sufficienty arge constant C. For the case when ξ ij are not Bernoui (but sti

6 6 DICIONARY LEARNING WIH FEW SAMPLES subgaussian) the cacuation in [15] requires an extra ogarithm term, which resuts in the bound p Cn 2 og 2 n New ingredients. Our first idea is to find a more efficient variant of the union bound P( v N0 Bad(v)) v N 0 P(Bad(v)). Motivated by the incusion-excusion formua we try to capture some gain when P(Bad(u) Bad(v)) is arge for many pairs u, v. We observe that if we can group the eements v of the net into custers so that within each custer, the events Bad(v) (seen as subsets of the underying probabiity space) are cose to each other. Assume, for a moment, that one can spit the net N 0 into m disjoint custers C i, 1 i m, so that if u and v beong to the same custer P(Bad(u)\Bad(v)) p 1, where p 1 is much smaer than p 0, then P( v Ci Bad(v)) P(Bad(v [i] )) + C i p 1, where v [i] is a representative point in C i. Summing over i, one obtains (2.4) P( v N0 Bad(v)) m P( v Ci Bad(v)) i=1 m P(Bad(v [i] )) + N 0 p 1 mp 0 + N 0 p 1. We gain significanty if p 1 is much smaer than p 0 and m is much smaer than N 0. Next, viewing the set of representatives v [i] as a new net N 1, we can iterate the argument, obtaining the foowing emma. Lemma 2.3. Let P be a probabiity space. Let N = N 0 be a finite set, where to each eement v N 0 we associate a set Bad 0 (v) P. Assume that we can construct a sequence of sets N L, N L 1,..., N 0, and for each u N, 1 L an event Bad (u) such that the foowing hods. For each v N 1, there is u N such that P(Bad 1 (v)\bad (u)) p and for each u N L, P(Bad L (u)) p 0. hen (2.5) P( v N0 Bad 0 (v)) N L p 0 + i=1 L N 1 p. he construction of N are of critica importance, and we are going to construct them using the distance, rather than the obvious choice of 1. his is the key point of our method. he next main technica ingredient is a more efficient way of using Bernstein s inequaity, Lemma 2.2. Reca the bound 2 (2.6) P( S ES ) 2 exp( 2(VarS + τ) ) 2 exp( min{ 2 4VarS, 4τ }). he first term 2 4VarS on the right most formua is usuay optima. However, we need to improve the second term. he idea is to repace τ with a smaer quantity τ such that the probabiity that Z i τ is cose to 1. Let us iustrate this idea with the upper tai. Set µ := ES, we consider Write P(S µ + ). =1 Z i := Z i J i + Z i I i

7 DICIONARY LEARNING WIH FEW SAMPLES 7 where J i is the indicator of the event Z i τ and I i = 1 I i. hus S := i Z i I i + i Z i J i = Q + S(1). Let µ j be the expectation of S(j). hen P(S µ + ) P(Q µ 1 + /2) + P(S(1) µ 2 + /2). We can use Lemma 2.2 to bound P(Q µ 1 + /2), which provides a bound better than (2.6) as now τ < τ. On the other hand, if the probabiity that Z i τ is sma, then we can bound P(S(1) µ 2 + /2) in a different way, expoiting the fact that there wi be very few non-zero summands in S(1). We can (and have to) further refine this idea by considering a sequence of τ, breaking S into the sum of Q and S(k), 1 k M, for a propery chosen M. his wi be our eading idea to bound the difference probabiity p in the next section. On the abstract eve, our method bears a simiarity to the chaining argument from the theory of Banach spaces. We are going to discuss this point in Section Proof of heorem 1.3 For the sake of presentation, we assume that x ij = χ ij ξ ij where χ ij are iid Bernoui random variabes with mean θ and ξ ij are iid Rademachers random variabes. In fact, p Cn og 3 n is sufficient for the Rademacher case. he proof can be easiy modified for ξ ij being genera subgaussian at the cost of a og n factor in the bound for p (See Section 3.6). We reca the notation µ min = p θ/n, µ max = pθ; µ v := E X v 1. B is the set of a vectors of unit 1 norm. We set p = Cn og 3 n, for a sufficienty arge constant C. Let := c 0µ min og n for a sma constant c 0 > 0 and K := 6µmax α-nets in norm. Lemma 3.1. For any 1 α 2/n, B admits an α-net in norm of size at most exp(2α 1 og n). Proof. Let N be the coection of a vectors v B, whose coordinates are integer mutipes of α. Obviousy, N is an α-net of B in norm. Furthermore, any v N satisfies v 1 1, so it has at most k := α 1 non-zero coordinates. If a coordinate is non-zero, it can take at most 2α k vaues. herefore, As α 2/n, the RHS is at most N k i=0 ( ) n (3k) k. i ( ) n n (3k) k n( en k k 3k)k = n(2en) k exp(2α 1 og n). he key here is that we consider an α-net in norm, rather than in 1 norm, which appears to be a natura choice.

8 8 DICIONARY LEARNING WIH FEW SAMPLES 3.2. Buiding a nested sequence. Reca that N 0 is the set of vectors v in B whose coordinates are integer mutipes of n 3. We have (3.1) N 0 (2n 3 + 1) n exp(4n og n). Consider the sequence α 0 = 2/n; α = 2α 1 for = 1,..., L, where L og 2 n is the first index such that α L > 1/2. Let N be an α -net of B in the norm. By Lemma 3.1, we can choose N such that (3.2) N exp(2α 1 og n). We now buid a nested sequence N L N L 1 N 1 N 0 as foows. Assume that N 1 has been buit. Use the points in N as centers to construct a Voronoi partition of the points of N 1 with respect to the norm (ties are broken arbitrariy). For each point u N, et C u be the subset of N 1 corresponds to u. By definition, u v α for any v C u, Partition the interva [µ min, µ max ] = [p θ/n, pθ] into K intervas I 1,..., I K of equa engths. We partition C u further into K subsets C u,j, 1 j K, where v C u,j if E Xv 1 I j. By this construction, if v, w beong to the same C u,j, then by the definition of K, we have the key reations (3.3) v w 2α and E Xv 1 E Xw 1 pθ/k /6. From each set C u,j choose an arbitrary eement v. hus, each u N gives rise to a set R u of K eements (R stands for representative). Define It is cear that N N 1 and N := u N R u. (3.4) N K N K exp(2α 1 og n) Bounding the differences. Consider the construction of N, 1 L, from Section 3.1. Let v N. hus, v C u,j for some u N and 1 j K. Consider another point w N u,j. Our main task is to show Lemma 3.2. For a pairs v, w as above (3.5) ρ(v, w) := P( X v 1 X w 1 ) exp( 5α 1 og n). he rest of this section is devoted to the proof of this emma. By (3.3), we have (3.6) v w 2α and E X v 1 E X w 1 pθ/k /6. Define Z I = X i v X i w, where X i is the ith row of X ; we have X v 1 X w 1 = p ( X i v X i w ) = i=1 Set S := p i=1 Z i; by symmetry, it suffices to bound p Z i. i=1 P(Z Z p ) := P(S ). Notice that by the triange inequaity Z i = X i v X i w X i (v w). herefore,

9 DICIONARY LEARNING WIH FEW SAMPLES 9 VarZ i EZi 2 E X i (v w) 2 = θ Reca that v, w 1 and v w α. herefore n (v j w j ) 2. j=1 his impies n n (v j w j ) 2 α v j + w j = 2α. j=1 j=1 (3.7) VarZ i EZ 2 i 2α θ. We denote by I i,k the event that τ k < Z i τ k 1 for k = 1,..., M and J i the event that Z i τ M, for a sequence τ k, k = 0,..., M, where τ 0 = 2; τ i = 2 i τ 0 and M is the first index so that (3.8) min{ τ 2 M 8α θ, τ M 4α } 8 og n. Note that if α 1 32 og 1 n then such an index M 1 exists. We wi proceed with this assumption and cover the remaining cases at the end of the proof. Apparenty, Z i k Z i I i,k + Z i J i. Set S(k) = p i=1 Z ii i,k for k = 1,..., M and Q = p i=1 Z ij i. We have i=1 P(S ) P(Q /2) + M P(S(k) k=1 2M ). o bound P(Q /2), we notice that (see (3.11)) the choice of τ M guarantees that P(J i ) 1 2n 8 for a i = 1,..., p. As Z i 2 with probabiity 1, it foows that and so EZ i J i EZ i 4n 8 EQ ES 4pn 8 = o(n 6 ), as p = o(n 2 ). On the other hand, by (3.6), 5(ES + n 6 ). hus P(Q /2) P(Q EQ + /4). By definition, Q is sum of p iid random variabes, each is bounded by τ M in absoute vaue with probabiity 1. Furthermore, by (3.7) By Lemma 2.2, we have VarQ = pvarz 1 J 1 pez 2 1 2α θp. (3.9) P(Q EQ + /4) 2(exp( min{ (/4)2 8α θp, /4 }) = 2 exp( min{ 4τ M 128α θp, 16τ M }). Now we bound P(S(k) 2M ), for k = 1,..., M. Reca that S(k) := p i=1 Z ii i,k is a sum of iid non-negative random variabes, each is either 0 or in (τ k and τ k 1 ]. hus, if S(k) /2M there

10 10 DICIONARY LEARNING WIH FEW SAMPLES must be at east p k := /2M τ k 1 indices i such that Z i > τ k. Let ρ k be the probabiity that Z 1 > τ k. hen by the union bound and the fact that p = o(n 2 ), (3.10) P(S(k) ( ) p 2M ) ρ p k p k (ep ρ k ) p k ( n2 k p k 2 ρ k) p k. o compete the anaysis, we need to estimate ρ k. By definition ρ k := P( X 1 v X 1 w > τ k ) P( X 1 (v w) τ k ). he random variabe Z 1 := X 1 (v w) = n j=1 ξ j(v j w j ) has mean 0. Furthermore, by (3.7), Var Z 1 Z 2 1 2α θ. Finay, each term ξ j (v j w j ) is at most α in absoute vaue. hus Lemma 2.2 impies (3.11) ρ k P( Z 1 τ k ) 2(exp( min{ τ 2 k 8α θ, τ k 4α }). his and (3.10) yied (3.12) P(S(k) ( 2M ) 2 exp( min{ τ k 2 8α θ, τ ) k } + 2 og n p k ). 4α By (3.8), so By definition p k = /2M τ k 1 and min{ τ 2 k 8α θ, τ k 4α } 8 og n, ( min{ τ k 2 8α θ, τ ) k } + 2 og n p k 1 4α 2 min{ τ k 2 8α θ p k, τ k p k }. 4α = /4M τ k, as τ k 1 = 2τ k. herefore, τ 2 k 1 2 8α θ p k = τ k 64Mα θ By (3.9) and (3.12), we concude that τ k 1 p k = 2 4α 32Mα. (3.13) P(S ) 2 exp( min{ 128α θp, }) + 16τ M 2 M τ k 2 exp( min{ 64Mα θ, k=1 32Mα }). A routine verification (see Section 3.5) shows that once p Cn og 3 n for a sufficient arge constant C, then the RHS in (3.13) is at most exp( 5α 1 og n), competing the proof for the case α 1 32 og 1 n. o compete the proof, we now treat the remaining case when α 1 32 og 1 n.. In this case, we do not need to spit Z i. Reca S = Z Z p whre Z i 2 with probabiity 1, ES /6 and VarS 2pθα. By Lemma 2.2, we have P(S ) P(S ES + /2) exp( min{ 2 8pθα, 8 }).

11 DICIONARY LEARNING WIH FEW SAMPLES 11 By the anaysis of (3.13), we aready know that 2 8pθα α 1 32 og 1 n 5α 1 og n. On the other hand, as 8 = c 0p θ/n 8 og n = c 0C 8 given that c 0 C is sufficienty arge. his competes the proof. θn og 2 n 5α 1 og n, 3.4. Proof of the Concentration emma. For v N, 0 L, et Bad (v) be the event that Xv 1 µ v 2(L + 1 ). For = 0, 2(L + 1 = 2(L + 1) 2c 0(og 2 n+1)µ min og n 4c 0 µ min. hus, P( v N0 X v 1 µ v 4c 0 µ min ) P( v N0 Bad 0 (v)). Assume that there is a number p 0 such that P(Bad 0 (v)) p 0 for a v N 0. Assume furthermore that for any 1 L, there is a number p such that for v N and w N 1 where v is the representative of the set C (u,k) that contains w (see the construction in Section 3.2). hen by Lemma 2.3 P(Bad (w)\bad 1 (v)) p. P( v N0 ) N L p 0 + L N 1 p. o find p, notice that if Bad 1 (w) hods and Bad (v) does not, then X w 1 µ w 2(L + 2 ) and X v 1 µ v 2(L + 1 ). By (3.3), µ v µ w. It thus foows that X w 1 X v 1. By the main emma of Section 3.3, we know that the probabiity of this event is at most p := exp( 5α 1 og n), for a. Reca from Section 3.2 that =1 we have N K exp(2α 1 og n) = K exp(4α 2 og n), L N 1 p =1 Since K = O(n 1/2 ) and α 1 L =1 exp( 4α 1 og n og n, the RHS is at most og n) K exp(4α 1 og n). L =1 exp(.5α 1 og n) = o(1). o concude, notice that by Lemma 2.2, we can set p 0 := 2 exp( min{ 2 8pθ, 8 }). exp( 2α 1 L og n) exp(4 og n) since α L 1/2, we have As N L as ong as min{ 2 8pθ, 8 } 5 og n. his constant C. his impies that p 0 N 0 = o(1), condition hods if p Cn og3 n for a sufficienty arge and we are done by (2.2). P( v N0 { X v µ v 4c 0 µ min ) = o(1),

12 12 DICIONARY LEARNING WIH FEW SAMPLES 3.5. he magnitude of p. We present the routine verification concerning the exponents in (3.13). his is the ony pace where the magnitude of p matters. Reca that = c 0µ min og n = c 0p θ/n og n p = Cn og 3 n (since for the sake of exposition we are ony considering the Rademacher case). We have 2 128α θp = c2 0 p2 θ/n 128θp og 2 n α 1 = c2 0 p n α 1 provided that c 2 0 C 4.1. By the definition of M in (3.8), we have = c 2 0Cα 1 og n 4.1α 1 og n, and his impies that It foows that 32 og n min{ τ 2 M 8α θ, τ M 4α } 8 og n. τ M max{16 α θ og n, 128α og n}. min{ 16τ M 256 α θ og n, α og n }. By the definition of p and 256 α θ og n = c 0 p 256 α n og 3 n = α 1 c 0 C og n nα 4.1α 1 og n, since c 0 C 4.1 and nα nα 0 n 2 n > 1. Furthermore, α og n = c 0 Cn og 3 n θ/n α og n Finay, we bound the exponent. By definition τ τ k 64Mα θ 8 α θ og n = α 1 c 0 C 64 og nα θ 8 concuding the proof. = ω(α 1 og n). Next, we bound the exponent 32Mα. As M og n, we have c 0Cn og 2 n θ/n = α 1 c 0 C 32Mα 32α og n θn og n 4.1α 1 32 og n, provided that c 0 C/32 4.1, since θn 1. τ k k 2 64Mα θ 8α θ 8 og n and M og n thus nα og 3/2 n = ω(α 1 og n), 3.6. Extension from Rademacher to genera sub-gaussian variabes. We introduce the truncation operator τ : R n p R n p as { Mij M ( τ [M]) ij = ij τ 0 ese Let τ = C og n and et X = τ [X]. For C sufficienty arge, the probabiity that X = X is 1 o(1). his aows us to work with random matrix whose entries are bounded by τ (instead of 1 as in the Rademacher case). he same proof wi go through if we increase p by C 1 τ, for a sufficienty arge constant C 1. his means p = O(n og 3.5 n) suffices. We round 3.5 up to 4 for cosmetic reasons.

13 DICIONARY LEARNING WIH FEW SAMPLES Concuding remarks. here is a connection between the method of our proof and Fernique s chaining argument [5] (see [16] for a survey). he goa of the chaining method is to bound the supermum sup t B X t where B is a domain in a metrics space and X t is a Gaussian process. In this case, the bad event Bad(v) can roughy be defined as X v M v, for some candidate vaue M v. One then considers a chain of sets in order to bound P( v B Bad(v)). his, in spirit, is simiar to the purpose of Lemma 2.3. After this, the arguments become different in a aspects. First, in our setting, the bad event Bad(v) can have any nature. Next, in the chaining argument, the sets N j are defined using the metrics of B, whie in our case, it is crucia to use a different metrics. We construct N j using the norm, rather than the natura 1 norm used to define the domain B. Finay, in the chaining case it is easy to bound P(Bad(u)\Bad(v)), using the fact that P( X u X v t) 2 exp( t2 ), dist(u,v) 2 which is the basic property of a Gaussian process. In our case, bounding P(Bad(u)\Bad(v)) is an essentia step (Lemma 3.2), which requires the deveopment of the refined Bernstein s inequaity. 4. he agorithm and concentration of random matrices As the agorithm and anaysis are discussed extensivey in [15], we wi be brief and the readers can consut [15] for more detais. [15] introduces the dictionary earning agorithm ER-SpUD. he key insight in the design of ER-SpUD is that the rows of X are ikey to be the sparsest vectors in the row space of Y. (his observation aso appeared [20] and [11].) [15] proposed to find these vectors by considering the foowing optimization probems. minimize w Y 1 subject to r w = 1 where r is a row of two coumns of Y. Using 1 optimization for finding sparse vectors is a natura idea, and the authors of [15] pointed out that such an approach was aready proposed in [13] and [8]. he difference is the new constraint r w = 1. (Earier works used different constraints.) By a change of variabes z = A w, b = A 1 r, we can consider the equivaent probem (4.1) minimize z X 1 subject to b z = 1. he agorithm presented in [15] is outined beow (for those famiiar with [15], note that we are presenting the two-coumn version of ER-SpUD): Agorithm 1 ER-SpUD 1: Randomy pair the coumns of Y into p/2 groups g j = {Y e j1, Y e j2 } 2: For j = 1,..., p/2 Let r j = Y e j1 + Y e j2, where g j = {Y e j1, Y e j2 } Sove min w w Y 1 subject to (Yr j ) w = 1, and set s j = w Y. 3: Use Greedy agorithm to reconstruct X and A. Agorithm 2 Greedy 1: Require: S = {s 1,..., s } R p 2: For i = 1... n REPEA arg min s S s 0, breaking ties arbitrariy x i = s S = S\{s } UNIL rank([x 1,..., x i ]) = i 3: Set X = [x 1,..., x i ], and A = YY (XY ) 1

14 14 DICIONARY LEARNING WIH FEW SAMPLES A key technica step in anayzing ER-SpUD is the foowing emma, which asserts that if p is sufficienty arge, then with high probabiity X v 1 is cose to its mean, simutaneousy for a unit vectors v R n. Lemma 4.1. For every constant 1 δ > 0 there is a constant C 0 > 0 such that the foowing hods. If θ 1 n and p C 0n 2 og 2 n, then with probabiity 1 o(1), for a v R n (4.2) X v 1 E X v 1 δe X v 1. his emma appears impicity in [15]. Dan Spieman pointed out to us that this woud impy the critica [15, Lemma 17]. he bound p Cn 2 og 2 n is of importance in the proof of this emma. Our heorem 1.3, which pushes p to Cn og 4 n, is an improved version of Lemma 4.1. With heorem 1.3 in hand, et us now sketch the proof of heorem 1.4, foowing the anaysis in [15]. Notice that if the soution of the 1 optimization probem, z, is 1-sparse, then the agorithm wi recover a row of X. he proof of the theorem reies on showing that z, is supported on the non-zero indices of b and that with high-probabiity, z is in fact 1-sparse. he first goa aows us to focus our attention on a submatrix of X which wi be convenient for technica reasons. o address this first issue, we prove the foowing. Lemma 4.2. Suppose that X satisfies the Bernoui-Subgaussian mode. here exists a numerica constant C > 0 such that if θn 2 and p > Cn og 4 n then the random matrix X has the foowing property with probabiity at east 1 o(1). (P1) For every b satisfying b 0 1/8θ, any soution z to the optimization probem 4.1 has supp(z ) supp(b). Sketch of the Proof of Lemma 4.2. We et J be the indices of the s non-zero entries of b. Let S be the indices of the nonzero coumns in X J, and et z 0 = P J z (the restriction to those coordinates indexed by J). Define z 1 = z z 0. We demonstrate that z 0 has at east as ow an objective as z so z 1 must be zero. One can show using the triange inequaity that z X 1 z 0 X 1 2 z 1 X S 1 + z 1 X 1. hus, if z 1 X 1 2 z 1 XS 1 > 0, then z 0 has a ower objective vaue. We need this inequaity to hod for a z with high probabiity. Notice that E[ z X 1 2 z X S 1 ] = (p 2 S )E z X 1 It is easy to show that S < p/4 with high probabiity so (p 2 S ) > 0 with high probabiity. herefore, if we can show that z X 1 2 z X S 1 is concentrated near its positive expectation we are done. We see that it suffices to show the resut for the worst case S = p/4. Now we make critica use of heorem 1.3, which asserts that with high probabiity, and so z X E z X 1 = 5p 8 E z X 1. z X S E z X S 1 = p 8 E z X 1. z X 1 2 z X S 1 p 2 E z X 1 > 0. Having proved Lemma 4.2, the rest of the proof is reativey simpe and foows [15] exacty. he success of the agorithm now depends on the existence of a sufficient gap between the argest and

15 DICIONARY LEARNING WIH FEW SAMPLES 15 second argest entry in b. he intuition is that if X preserved the 1 norm exacty, i.e. z X 1 = c z 1, then the minimization procedure wi output the vector z of smaest 1 norm such that b z = 1, which is just e j /b j, where j is the index of the eement of b with the argest magnitude. However, X ony preserves the 1 norm in an approximate sense. Yet, the agorithm wi sti extract a coumn of X if there is a significant gap between the argest eement of b and the second argest. 5. Rectanguar dictionaries and heorem 1.5 We now present a generaization of ER-SpUD, which enabes us to dea with rectanguar dictionary. Consider a fu rank matrix A of size n > m, such that n > m, and the equation AX = Y. o dea with this setting, we first augment A to be a square, n n, invertibe matrix. Of course, the issue is that one does not know A, and aso need to figure out how the augmentation changes the product Y. We can sove this issue using a random augmentation. For instance, we can use n (n m) gaussian matrix B to augment A to a square matrix A (the entries in B are iid standard gaussian). It is trivia that the augmented matrix has fu rank with probabiity 1, since the probabiity that a gaussian vector beongs to any fixed hyperpane is zero. We can aso augment X from an m p matrix to a n p matrix, X by an (n m) p random matrix Z with entries iid to those of X. his augmentation process yieds a matrix equation Y = A X where Y = Y+E where E = BZ (Figure 1). In practice, we can first generate B, Z, then compute E := BZ and construct Y := Y + E. Next then appy the ER-SpUD agorithm to the equation Y = A X to recover A and X with high probabiity. From these two matrices, we can then deduce A and X. Using a gaussian (or any continuous) augmentation is convenient, as the resuting matrix is obviousy fu rank. However, it is, in some way, a cheat. Apparenty, a gaussian number does not have any finite representation, thus it takes forever to read the input, et aone process it. A common practice is to truncate (as a matter of fact, the computer ony generates a finite approximation of the gaussian numbers anyway), and hope that the truncation is fine for our purpose. But then we face a non-trivia theoretica question to anayze this approximation. How many decima paces are enough? Even if we can prove a guarantee here, using it in practice woud require computing with a matrix with many ong entries, which significanty increases the running time. We can avoid this probem by using random matrices with discrete distributions, such as ±1. he technica issue now is to prove the fu rank property. his is a highy non-trivia probem,but uckiy was taken care of in the foowing resut of Bourgain, Vu, and Wood [3]. heorem 5.1. For every ɛ > 0 there exists δ > 0 such that the foowing hods. Let N f,n be an n by n compex matrix in which f rows contain fixed, non-random entries and where the other rows contain entries that are independent discrete random variabes. If the fixed rows have co-rank k and if for every random entry α, we have max x P(α = x) 1 ɛ, then for a sufficienty arge n P(N f,n has co-rank > k) (1 δ) n f. Letting, k = 0 and f = m, the resut shows that if we augment A by n (m n) random Bernoui matrix, this new matrix, A, wi be nonsinguar with high probabiity, given that n m = ω(1). We summarize our reasoning in the foowing agorithm.

16 16 DICIONARY LEARNING WIH FEW SAMPLES A X A Ã X X = AX + Ã X Figure 1. Rectanguar A with n > m Agorithm 3 Rectanguar Agorithm 1: Generate a (n m) p matrix Z with iid random variabes that agree with the mode for X. 2: Generate a n (n m) matrix B with iid entries (either Gaussian or Rademacher). 3: Run ER-SpUD on Y = Y + BZ 4: Remove the rows of A and the coumns of X from the output of ER-SpUD. 6. Optima bound for very sparse random matrices In this section, we discuss heorem 1.6. We present a simpe agorithm (see beow) and use this agorithm to prove heorem 1.6, obtaining the optima bound p = Cn og n. Agorithm 4 Very-sparse Agorithm 1: Partition the coumns of Y into a minimum number of groups G i whose members are mutipes of each other. 2: Choose representatives of those G i with more than two members to be the coumns of A up to scaing. Proof of heorem 1.6. Since A is nonsinguar, any two coumns of Y that are mutipes of each other must be inear combinations of the same coumns of A. For a group G i to have more than two members woud require that there be more than two coumns in X with their non-zero entries in the same rows. Definition 6.1. We say that a set of coumns are aigned if they each have more than one nonzero entry and their non-zero entries occur in the same positions. Lemma 6.2. he probabiity that X has more than two aigned coumns is o(1). hus, the agorithm is ikey to yied ony coumns of A. We now need to show that a the coumns of A wi be outputted with high probabiity. Definition 6.3. We say the coumn a of A is k-represented if some group G i consists of mutipes of a and G i = k. In particuar, if no mutipe of the jth coumn, a j, shows up in the coumns of Y then a j is 0-represented. A coumn is we represented if it is k-represented for k > 2. Notice that the agorithm wi output a mutipe of every coumn that is we represented. he foowing emma finishes the proof of heorem 1.6. Lemma 6.4. he probabiity that every coumn a i is we represented is 1 o(1). 1

17 DICIONARY LEARNING WIH FEW SAMPLES Proofs of Sparse Agorithm. Proof of Lemma 6.2. Given the choice of θ, we know that the number of nonzero entries in any coumn of X wi converge to the Poisson distribution. We ignore the o(1/n) error terms from this approximation in ater cacuations to aeviate cutter. o cacuate the probabiity, we condition on the number of nonzero entries, and then we bound the probabiity that three specific coumns have the required property, and finay we use the union bound. his yieds an upper bound of ( n 3 ) Proof of Lemma 6.4. By the union bound, k 2 e 3c (k!) 3 1 ( n k) 2 = o(1) P( i such that a i is not we represented) np(a 1 is not we represented) Partitioning into disjoint events yieds P(a 1 is not we represented) = 2 P(a 1 is j-represented) Notice that a mutipe of a 1, say a a 1, appears as a coumn of Y if and ony if a e 1 = (a, 0, 0,..., 0), with a 0, is X j, the jth coumn of X, for some j. Now, using the Poisson approximation we can bound each term in the summand. For exampe, for the probabiity of being 0-represented, we can divide into the case that X i does not have exacty one non-zero eement and the case that X i has exacty one non-zero term but not in the first row. We use C to indicate an absoute constant which may change with each appearance. ( P(a 1 is 0-represented) (1 ce c ) + e c n 1 ) p C exp( Cp/n) n Simiary, ( ) ( ce c P(a 1 is 1-represented) n (1 ce c ) + e c n 1 ) p 1 C exp( Cp/n) n n and P(a 1 is 2-represented) ( n 2 ) ( ce c n j=0 ) 2 ( (1 ce c ) + e c n 1 ) p 2 C exp( Cp/n) n hus, P(a 1 is not we represented) C exp(og n Cp/n) = o(1) for p = C n og n for a arge enough C. 7. Proof of heorem Lemmas Independent of Symmetry. We first state the necessary emmas from [15] whose proofs do not use the symmetry of the random variabes. Lemma 7.1. If rank(x) = n, A is nonsinguar, and Y can be decomposed into Y = A X, then the row spaces of X, X, and Y are the same. he genera idea is to show that the sparsest vectors in the row-span of Y are the rows of X. Since a of the rows of X ie in the row-span of Y, intuitivey, they can be sparse ony when they are mutipes of the rows of X. Naivey, this is because rows of X are ikey to have neary disjoint supports. hus, any inear combination of them wi probaby increase the number of nonzero entries.

18 18 DICIONARY LEARNING WIH FEW SAMPLES Lemma 7.2. Let Ω be an n p Bernoui(θ) matrix with 1/n < θ < 1/4. For each set S [n], et S [p] be the indices of the coumns of Ω that have at east one non-zero entry in some row indexed by S. (a) For every set S of size 2, (b) For every set S of size σ with 3 σ 1/θ, (c) For every set S of size σ with 1/θ σ, ( P( S (4/3)θp) exp θp ) 108 P( S (3σ/8)θp) exp P( S (1 1/e)p/2) exp 7.2. Generaized Lemmas. We wi use a resut of [14]. ( σθp ) 64 ( ) (1 1/e)p 8 Lemma 7.3. Let ξ 1,..., ξ n be independent centered random variabes with variances at east 1 and fourth moments bounded by B. hen there exists ν (0, 1) depending ony on B, such that for every coefficient vector a = (a 1,..., a n ) S n 1 the random sum S = n k=1 a kξ k satisfies P( S < 1 2 ) ν Definition 7.4. We ca a vector α R n fuy dense if for a i [n], α i 0. Lemma 7.5. For b > s, et H R s b be a matrix with one nonzero in each coumn. Let R be a s-by-b matrix with independent centered random variabes with variances at east 1 and bounded fourth moments. Define U = H R hen the probabiity that the eft nuspace of U contains a fuy dense vector is at most Proof of Lemma 7.5. Let U = [u 1... u b ] denote the coumns of U and for each j [b], et N j be the eft nuspace of [u 1... u j ]. We show that with high probabiity N b cannot contain a fuy dense vector. his can be done by showing that if N j 1 contains a fuy dense vector then with probabiity 1/2 the dimension of N j is ess than the dimension of N j 1. Formay, consider a fuy dense vector α N j 1. If u j contains ony one nonzero entry, then α u j 0 reducing the dimension of N j. If u j contains more than one non-zero entry, then Lemma 7.3 impies that the probabiity, over the choice of entries of R j, that α u j = 0 is ess than 1/2. Note that the dimension cannot decrease more than s times. For N b to contain a fuy dense vector, there must be at east b s coumns for which the dimension of the nuspace does nto decrease. Let F [b] have size b s. he probabiity that for every j F, N j 1 contains a fuy dense vector and that the dimension of N j equas the dimension of N j 1 is at most 2 b+s 1. By the union bound, the probabiity that N b contains a fuy dense vector is at most ) ) s 2 b+s 2 b+s og(e2 b/s) ( b b s 2 b+s ( eb s he proofs of the foowing emmas are identica to those in [15] except that they now use our more genera Lemma 7.5 aong with the emmas in the previous section. Lemma 7.6. For t > 200s, et Ω {0, 1} s t be any binary matrix with at east one nonzero in each coumn. Let R R s t be a random matrix whose entries are iid random variabes, with P(R ij = 0) = 0, and et U = Ω R. hen, the probabiity that there exists a fuy-dense vector α for which α U 0 t/5 is at most 2 t/25.

19 DICIONARY LEARNING WIH FEW SAMPLES 19 Lemma 7.7. If X = Ω R foows the Bernoui-Subgaussian mode with P(R ij = 0) = 0, 1/n < θ < 1/C and p > Cn og n, then the probabiity that there is a vector α with support of size arger than 1 for which α X 0 (11/9)θp is at most exp( cθp), and C, c are numerica constants Proof of heorem 1.7. Say Y can be decomposed as A X. From Lemma 7.7, we know that with probabiity at most exp( cθp), any inear combination of two or more rows of X has at east (11/9)θp nonzeros. By a simpe Chernoff bound, the probabiity that any row of X has more than (10/9)θp nonzero entries is bounded by n exp( θp/243). hus, the rows of X are ikey the sparsest in row(x). On the previous event of probabiity at east 1 exp( cθp), X does not have any eft nu vectors with more than one nonzero entry. herefore, if the rows of X are nonzero, X wi have no nonzero vectors in its eft nuspace. he probabiity that a of the rows of X are nonzero is at east 1 n(1 θ) p 1 n exp( cp). From this, by Lemma 7.1, we get row(x) = row(y ) = row(x ). Hence, we can concude that every row in X is a scaar mutipe of a row of X. 8. Numerica Simuations We demonstrate that the efficiency of the ER-SpUD agorithm is not improved with arger p vaues beyond the threshod conjectured. In Figure 2, we have chosen A to be an n n matrix of independent N(0, 1) random variabes. he n p matrix X has k randomy chosen non-zero entries which are Rademacher. he graph on the eft of Figure 2 is generated with p = 5n og n and the one on the right with p = 5n 2 og 2 n. For both graphs, n varies from 10 to 60 and k from 1 to 10. Accuracy is measured in terms of reative error: re(a, A) = min Π,Λ A ΛΠ A F / A F he average reative error over ten trias is reported. sparsity: k ER SpUD Sma Sampe sparsity: k ER SpUD Large Sampe Re. Error dictionary size: n dictionary size: n 0 Figure 2. Mean reative errors of ER-SpUD with p = 5n og n versus p = 5n 2 og 2 n We then ran our Agorithm 6 in a sparse regime to compare its performance with that of ER- SpUD (see Figure 3. A was as before, but since our agorithm reies on the appearance of 1-sparse coumns in X, we cannot fix sparsity as in our first experiments. Rather, we vary the Bernoui parameter θ from 0.02 to 0.18, and the χ ij are Rademacher. One can see the expected phase transition at which point the matrix X is no onger sparse enough for our agorithm. In the regime

20 20 DICIONARY LEARNING WIH FEW SAMPLES for which the agorithm was designed, the reative error of our output is on the same order as that of ER-SpUD. Furthermore, our agorithm runs much quicker and has no troube with inputs of size up to n = 500. (he numerica experiments were competed on a Macbook Pro.) Finay, we compare the outcome of our optima p vaue with that of a much arger sampe size (p = O(n 2 og 2 n)). We et n range from 10 to 200 and θ from 0.01 to Figure 4 shows that the efficacy of the agorithm is not much improved despite the dramatic increase in p. he threshod for faiure is identica Sparse Agorithm ER SpUD Re. Error sparsity: theta sparsity: theta dictionary size: n dictionary size: n 0 Figure 3. Mean reative errors with varying sparsity θ. Here, p = 5n og n. Sparse Ag Sma Sampe Sparse Ag Large Sampe Re. Error sparsity: theta sparsity: theta dictionary size: n dictionary size: n 0 Figure 4. Mean reative errors of Agorithm 6 with p = 5n og n versus p = 5n 2 og 2 n

21 DICIONARY LEARNING WIH FEW SAMPLES 21 References [1] Micha Aharon, Michae Ead, and Afred Bruckstein. he k-svd: An agorithm for designing overcompete dictionaries for sparse representation. Signa Processing, IEEE ransactions on, 54(11): , [2] Micha Aharon, Michae Ead, and Afred M Bruckstein. On the uniqueness of overcompete dictionaries, and a practica way to retrieve them. Linear agebra and its appications, 416(1):48 67, [3] Jean Bourgain, Van H Vu, and Phiip Matchett Wood. On the singuarity probabiity of discrete random matrices. Journa of Functiona Anaysis, 258(2): , [4] Kjersti Engan, Sven Oe Aase, and J Hakon Husoy. Method of optima directions for frame design. In Acoustics, Speech, and Signa Processing, Proceedings., 1999 IEEE Internationa Conference on, voume 5, pages IEEE, [5] X Fernique. Reguarite de processus gaussien. In Invent Math., pages [6] Pando Georgiev, Fabian heis, and Andrzej Cichocki. Bind source separation and sparse component anaysis of overcompete mixtures. In Acoustics, Speech, and Signa Processing, Proceedings.(ICASSP 04). IEEE Internationa Conference on, voume 5, pages V 493. IEEE, [7] Lee-Ad Gottieb and yer Neyon. Matrix sparsification and the sparse nu space probem. In Approximation, Randomization, and Combinatoria Optimization. Agorithms and echniques, pages Springer, [8] Forent Jaiet, Rémi Gribonva, Mark D Pumbey, and Hadi Zayyani. An 1 criterion for dictionary earning by subspace identification. In Acoustics Speech and Signa Processing (ICASSP), 2010 IEEE Internationa Conference on, pages IEEE, [9] Kenneth Kreutz-Degado, Joseph F Murray, Bhaskar D Rao, Kjersti Engan, e-won Lee, and errence J Sejnowski. Dictionary earning agorithms for sparse representation. Neura computation, 15(2): , [10] Liangyue Li, Sheng Li, and Yun Fu. Discriminative dictionary earning with ow-rank reguarization for face recognition. In Automatic Face and Gesture Recognition (FG), th IEEE Internationa Conference and Workshops on, pages 1 6. IEEE, [11] Juien Maira, Francis Bach, Jean Ponce, and Guiermo Sapiro. Onine dictionary earning for sparse coding. In Proceedings of the 26th Annua Internationa Conference on Machine Learning, pages ACM, [12] Bruno A Oshausen et a. Emergence of simpe-ce receptive fied properties by earning a sparse code for natura images. Nature, 381(6583): , [13] Mark D Pumbey. Dictionary earning for 1-exact sparse coding. In Independent Component Anaysis and Signa Separation, pages Springer, [14] Mark Rudeson and Roman Vershynin. he ittewood offord probem and invertibiity of random matrices. Advances in Mathematics, 218(2): , [15] Danie A Spieman, Huan Wang, and John Wright. Exact recovery of sparsey-used dictionaries. In Proceedings of the wenty-hird internationa joint conference on Artificia Inteigence, pages AAAI Press, [16] Miche aagrand. Majorizing measures: the generic chaining. he Annas of Probabiity, pages , [17] John Wright, Aen Y Yang, Arvind Ganesh, Shankar S Sastry, and Yi Ma. Robust face recognition via sparse representation. Pattern Anaysis and Machine Inteigence, IEEE ransactions on, 31(2): , [18] Qiang Zhang and Baoxin Li. Discriminative k-svd for dictionary earning in face recognition. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages IEEE, [19] Michae Zibuevsky. Bind source separation with reative newton method. In Proc. ICA, voume 2003, pages , [20] Michae Zibuevsky and Barak A Pearmutter. Bind source separation by sparse decomposition. In AeroSense 2000, pages Internationa Society for Optics and Photonics, 2000.

A Brief Introduction to Markov Chains and Hidden Markov Models

A Brief Introduction to Markov Chains and Hidden Markov Modes Aen B MacKenzie Notes for December 1, 3, &8, 2015 Discrete-Time Markov Chains You may reca that when we first introduced random processes,