Simple, Efficient and Neural Algorithms for Sparse Coding

Size: px

Start display at page:

Download "Simple, Efficient and Neural Algorithms for Sparse Coding"

Constance Oliver
5 years ago
Views:

1 Simple, Efficient and Neural Algorithms for Sparse Coding Ankur Moitra (MIT) joint work with Sanjeev Arora, Rong Ge and Tengyu Ma

2 B. A. Olshausen, D. J. Field. Emergence of simple- cell recepnve field propernes by learning a sparse code for natural images, 1996

3 B. A. Olshausen, D. J. Field. Emergence of simple- cell recepnve field propernes by learning a sparse code for natural images, 1996 break natural images into patches: (collecnon of vectors)

4 B. A. Olshausen, D. J. Field. Emergence of simple- cell recepnve field propernes by learning a sparse code for natural images, 1996 break natural images into patches: sparse coding (collecnon of vectors)

5 B. A. Olshausen, D. J. Field. Emergence of simple- cell recepnve field propernes by learning a sparse code for natural images, 1996 break natural images into patches: sparse coding (collecnon of vectors) Proper2es: localized, bandpass and oriented

6 B. A. Olshausen, D. J. Field. Emergence of simple- cell recepnve field propernes by learning a sparse code for natural images, 1996 break natural images into patches: (collecnon of vectors)

7 B. A. Olshausen, D. J. Field. Emergence of simple- cell recepnve field propernes by learning a sparse code for natural images, 1996 break natural images into patches: singular value decomposi2on (collecnon of vectors)

8 B. A. Olshausen, D. J. Field. Emergence of simple- cell recepnve field propernes by learning a sparse code for natural images, 1996 break natural images into patches: singular value decomposi2on Noisy! Difficult to interpret! (collecnon of vectors)

9 OUTLINE Are there efficient, neural algorithms for sparse coding with provable guarantees?

10 OUTLINE Are there efficient, neural algorithms for sparse coding with provable guarantees? Part I: The Olshausen- Field Update Rule A Non- convex FormulaNon Neural ImplementaNon A GeneraNve Model; Prior Work

11 OUTLINE Are there efficient, neural algorithms for sparse coding with provable guarantees? Part I: The Olshausen- Field Update Rule A Non- convex FormulaNon Neural ImplementaNon A GeneraNve Model; Prior Work Part II: A New Update Rule Online, Local and Hebbian with Provable Guarantees ConnecNons to Approximate Gradient Descent Further Extensions

12 More generally, many types of data are sparse in an appropriately chosen basis: b (i) e.g. images, signals, data (n p)

13 More generally, many types of data are sparse in an appropriately chosen basis: dicnonary (n m) A x (i) b (i) e.g. images, signals, representanons (m p) data (n p)

14 More generally, many types of data are sparse in an appropriately chosen basis: dicnonary (n m) at most k non- zeros A x (i) b (i) e.g. images, signals, representanons (m p) data (n p)

15 More generally, many types of data are sparse in an appropriately chosen basis: dicnonary (n m) at most k non- zeros A x (i) b (i) Sparse Coding/ Dic2onary Learning: Can we learn A from examples? e.g. images, signals, representanons (m p) data (n p)

16 NONCONVEX FORMULATIONS Usual approach, minimize reconstruc2on error: p p min A, x (i) s i = 1 b (i) - A x (i) + L(x (i) ) i = 1 non- linear penalty func2on

17 NONCONVEX FORMULATIONS Usual approach, minimize reconstruc2on error: p p min A, x (i) s i = 1 b (i) - A x (i) + L(x (i) ) i = 1 non- linear penalty func2on (encourage sparsity)

18 NONCONVEX FORMULATIONS Usual approach, minimize reconstruc2on error: p p min A, x (i) s i = 1 b (i) - A x (i) + L(x (i) ) i = 1 non- linear penalty func2on (encourage sparsity) This opnmizanon problem is NP- hard, can have many local opnma; but heuris2cs work well nevertheless

19 A NEURAL IMPLEMENTATION [Olshausen, Field]: output dicnonary stored as synapse weights residual image (snmulus) A i,j b j x i L (x i ) r j

20 A NEURAL IMPLEMENTATION [Olshausen, Field]: output dicnonary stored as synapse weights residual image (snmulus) b j

21 A NEURAL IMPLEMENTATION [Olshausen, Field]: output dicnonary stored as synapse weights residual image (snmulus) r j b j

22 A NEURAL IMPLEMENTATION [Olshausen, Field]: output dicnonary stored as synapse weights residual image (snmulus) r j A i,j b j

23 A NEURAL IMPLEMENTATION [Olshausen, Field]: output x i dicnonary stored as synapse weights residual image (snmulus) r j A i,j b j

24 A NEURAL IMPLEMENTATION [Olshausen, Field]: output L (x i ) dicnonary stored as synapse weights residual r j A i,j image (snmulus)

25 This network performs gradient descent on: 2 b - A x + L(x)

26 This network performs gradient descent on: 2 b - A x + L(x) by alternanng between (1) r b Ax (2) x x + η(a T r L(x)) Δ

27 This network performs gradient descent on: 2 b - A x + L(x) by alternanng between (1) r b Ax (2) x x + η(a T r L(x)) Moreover A is updated through Hebbian rules Δ

28 This network performs gradient descent on: 2 b - A x + L(x) by alternanng between (1) r b Ax (2) x x + η(a T r L(x)) Moreover A is updated through Hebbian rules Δ There are no provable guarantees, but works well

29 This network performs gradient descent on: 2 b - A x + L(x) by alternanng between (1) r b Ax (2) x x + η(a T r L(x)) Moreover A is updated through Hebbian rules Δ There are no provable guarantees, but works well But why should gradient descent on a non- convex funcnon work?

30 This network performs gradient descent on: 2 b - A x + L(x) by alternanng between (1) r b Ax (2) x x + η(a T r L(x)) Moreover A is updated through Hebbian rules Δ There are no provable guarantees, but works well But why should gradient descent on a non- convex funcnon work? Are simple, local and Hebbian rules sufficient to find globally opnmal solunons?

31 OTHER APPROACHES, AND APPLICATIONS

32 OTHER APPROACHES, AND APPLICATIONS Signal Processing/Sta2s2cs (MOD, ksvd): De- noising, edge- detecnon, super- resolunon Block compression for images/video

33 OTHER APPROACHES, AND APPLICATIONS Signal Processing/Sta2s2cs (MOD, ksvd): De- noising, edge- detecnon, super- resolunon Block compression for images/video Machine Learning (LBRN06, ): Sparsity as a regularizer to prevent over- fimng Learned sparse reps. play a key role in deep- learning

34 OTHER APPROACHES, AND APPLICATIONS Signal Processing/Sta2s2cs (MOD, ksvd): De- noising, edge- detecnon, super- resolunon Block compression for images/video Machine Learning (LBRN06, ): Sparsity as a regularizer to prevent over- fimng Learned sparse reps. play a key role in deep- learning Theore2cal Computer Science (SWW13, AGM14, AAJNT14): New algorithms with provable guarantees, in a natural genera2ve model

35 Genera2ve Model: unknown dicnonary A generate x with support of size k u.a.r., choose non- zero values independently, observe b = Ax

36 Genera2ve Model: unknown dicnonary A generate x with support of size k u.a.r., choose non- zero values independently, observe b = Ax [Spielman, Wang, Wright 13]: works for full coln rank A up to sparsity roughly n ½ (hence m n)

37 Genera2ve Model: unknown dicnonary A generate x with support of size k u.a.r., choose non- zero values independently, observe b = Ax [Spielman, Wang, Wright 13]: works for full coln rank A up to sparsity roughly n ½ (hence m n) [Arora, Ge, Moitra 14]: works for overcomplete, μ- incoherent A up to sparsity roughly n ½- ε /μ

38 Genera2ve Model: unknown dicnonary A generate x with support of size k u.a.r., choose non- zero values independently, observe b = Ax [Spielman, Wang, Wright 13]: works for full coln rank A up to sparsity roughly n ½ (hence m n) [Arora, Ge, Moitra 14]: works for overcomplete, μ- incoherent A up to sparsity roughly n ½- ε /μ [Agarwal et al. 14]: works for overcomplete, μ- incoherent A up to sparsity roughly n ¼ /μ, via alternanng minimizanon

39 Genera2ve Model: unknown dicnonary A generate x with support of size k u.a.r., choose non- zero values independently, observe b = Ax [Spielman, Wang, Wright 13]: works for full coln rank A up to sparsity roughly n ½ (hence m n) [Arora, Ge, Moitra 14]: works for overcomplete, μ- incoherent A up to sparsity roughly n ½- ε /μ [Agarwal et al. 14]: works for overcomplete, μ- incoherent A up to sparsity roughly n ¼ /μ, via alternanng minimizanon [Barak, Kelner, Steurer 14]: works for overcomplete A up to sparsity roughly n 1- ε, but running Nme is exponen2al in accuracy

40 OUR RESULTS Suppose k n/μ polylog(n) and A n polylog(n) Suppose A that is column- wise δ- close to A for δ 1/polylog(n)

41 OUR RESULTS Suppose k n/μ polylog(n) and A n polylog(n) Suppose A that is column- wise δ- close to A for δ 1/polylog(n) Theorem [Arora, Ge, Ma, Moitra 14]: There is a variant of the OF- update rule that converges to the true dicnonary at a geometric rate, and uses a polynomial number of samples

42 OUR RESULTS Suppose k n/μ polylog(n) and A n polylog(n) Suppose A that is column- wise δ- close to A for δ 1/polylog(n) Theorem [Arora, Ge, Ma, Moitra 14]: There is a variant of the OF- update rule that converges to the true dicnonary at a geometric rate, and uses a polynomial number of samples All previous algorithms had subopnmal sparsity, worked in less generality, or were exponen2al in a natural parameter

43 OUR RESULTS Suppose k n/μ polylog(n) and A n polylog(n) Suppose A that is column- wise δ- close to A for δ 1/polylog(n) Theorem [Arora, Ge, Ma, Moitra 14]: There is a variant of the OF- update rule that converges to the true dicnonary at a geometric rate, and uses a polynomial number of samples All previous algorithms had subopnmal sparsity, worked in less generality, or were exponen2al in a natural parameter Note: k n/2μ is a barrier, even for sparse recovery

44 OUR RESULTS Suppose k n/μ polylog(n) and A n polylog(n) Suppose A that is column- wise δ- close to A for δ 1/polylog(n) Theorem [Arora, Ge, Ma, Moitra 14]: There is a variant of the OF- update rule that converges to the true dicnonary at a geometric rate, and uses a polynomial number of samples All previous algorithms had subopnmal sparsity, worked in less generality, or were exponen2al in a natural parameter Note: k n/2μ is a barrier, even for sparse recovery i.e. if k > n/2μ, then x is not necessarily the sparsest soln to Ax = b

45 OUTLINE Are there efficient, neural algorithms for sparse coding with provable guarantees? Part I: The Olshausen- Field Update Rule A Non- convex FormulaNon Neural ImplementaNon A GeneraNve Model; Prior Work Part II: A New Update Rule Online, Local and Hebbian with Provable Guarantees ConnecNons to Approximate Gradient Descent Further Extensions

46 A NEW UPDATE RULE Alternate between the following steps (size q batches): (1) (2) x (i) = threshold(a T b (i) ) q A A + η (b (i) Ax (i) )sgn(x (i) ) T i = 1

47 A NEW UPDATE RULE Alternate between the following steps (size q batches): (1) (2) x (i) = threshold(a T b (i) ) q (zero out small entries) A A + η (b (i) Ax (i) )sgn(x (i) ) T i = 1

48 A NEW UPDATE RULE Alternate between the following steps (size q batches): (1) (2) x (i) = threshold(a T b (i) ) q A A + η (b (i) Ax (i) )sgn(x (i) ) T i = 1

49 A NEW UPDATE RULE Alternate between the following steps (size q batches): (1) (2) x (i) = threshold(a T b (i) ) q A A + η (b (i) Ax (i) )sgn(x (i) ) T i = 1 The samples arrive online

50 A NEW UPDATE RULE Alternate between the following steps (size q batches): (1) (2) x (i) = threshold(a T b (i) ) q A A + η (b (i) Ax (i) )sgn(x (i) ) T i = 1 The samples arrive online In contrast, previous (provable) algorithms might need to compute a new esnmate from scratch, when new samples arrive

51 A NEW UPDATE RULE Alternate between the following steps (size q batches): (1) (2) x (i) = threshold(a T b (i) ) q A A + η (b (i) Ax (i) )sgn(x (i) ) T i = 1

52 A NEW UPDATE RULE Alternate between the following steps (size q batches): (1) (2) x (i) = threshold(a T b (i) ) q A A + η (b (i) Ax (i) )sgn(x (i) ) T i = 1 The computanon is local

53 A NEW UPDATE RULE Alternate between the following steps (size q batches): (1) (2) x (i) = threshold(a T b (i) ) q A A + η (b (i) Ax (i) )sgn(x (i) ) T i = 1 The computanon is local In parncular, the output is a thresholded, weighted sum of acnvanons

54 A NEW UPDATE RULE Alternate between the following steps (size q batches): (1) (2) x (i) = threshold(a T b (i) ) q A A + η (b (i) Ax (i) )sgn(x (i) ) T i = 1

55 A NEW UPDATE RULE Alternate between the following steps (size q batches): (1) (2) x (i) = threshold(a T b (i) ) q A A + η (b (i) Ax (i) )sgn(x (i) ) T i = 1 The update rule is explicitly Hebbian

56 A NEW UPDATE RULE Alternate between the following steps (size q batches): (1) (2) x (i) = threshold(a T b (i) ) q A A + η (b (i) Ax (i) )sgn(x (i) ) T i = 1 The update rule is explicitly Hebbian neurons that fire together, wire together

57 A NEW UPDATE RULE Alternate between the following steps (size q batches): (1) (2) x (i) = threshold(a T b (i) ) q A A + η (b (i) Ax (i) )sgn(x (i) ) T i = 1 The update rule is explicitly Hebbian

58 A NEW UPDATE RULE Alternate between the following steps (size q batches): (1) (2) x (i) = threshold(a T b (i) ) q A A + η (b (i) Ax (i) )sgn(x (i) ) T i = 1 The update rule is explicitly Hebbian The update to a weight A i,j is the product of the acnvanons at the residual layer and the decoding layer

59 WHAT IS NEURALLY PLAUSIBLE, ANYWAYS? Our update rule (essennally) inherits a neural implementanon from [Olshausen, Field]

60 WHAT IS NEURALLY PLAUSIBLE, ANYWAYS? Our update rule (essennally) inherits a neural implementanon from [Olshausen, Field] However there are many compenng theories for what consntutes a plausible neural implementanon e.g. nonneganve outputs, no bidirecnonal links, etc

61 WHAT IS NEURALLY PLAUSIBLE, ANYWAYS? Our update rule (essennally) inherits a neural implementanon from [Olshausen, Field] However there are many compenng theories for what consntutes a plausible neural implementanon e.g. nonneganve outputs, no bidirecnonal links, etc But ours is online, local and Hebbian, all of which are basic propernes to require The surprise is that such simple building blocks can find globally op2mal solunons to highly non- trivial algorithmic problems!

62 APPROXIMATE GRADIENT DESCENT We give a general framework for designing and analyzing iteranve algorithms for sparse coding

63 APPROXIMATE GRADIENT DESCENT We give a general framework for designing and analyzing iteranve algorithms for sparse coding The usual approach is to think of them as trying to minimize a non- convex funcnon: min A, coln- sparse X E(A, X) = B - A X 2 F

64 APPROXIMATE GRADIENT DESCENT We give a general framework for designing and analyzing iteranve algorithms for sparse coding The usual approach is to think of them as trying to minimize a non- convex funcnon: min A, coln- sparse X E(A, X) = B - A X 2 F colns are b (i) s colns are x (i) s

65 APPROXIMATE GRADIENT DESCENT We give a general framework for designing and analyzing iteranve algorithms for sparse coding How about thinking of them as trying to minimize an unknown, convex funcnon?

66 APPROXIMATE GRADIENT DESCENT We give a general framework for designing and analyzing iteranve algorithms for sparse coding How about thinking of them as trying to minimize an unknown, convex funcnon? min A E(A, X) = B - A X 2 F

67 APPROXIMATE GRADIENT DESCENT We give a general framework for designing and analyzing iteranve algorithms for sparse coding How about thinking of them as trying to minimize an unknown, convex funcnon? min A E(A, X) = B - A X 2 F Now the funcnon is strongly convex, and has a global opnmum that can be reached by gradient descent!

68 APPROXIMATE GRADIENT DESCENT We give a general framework for designing and analyzing iteranve algorithms for sparse coding How about thinking of them as trying to minimize an unknown, convex funcnon? min A E(A, X) = B - A X 2 F Now the funcnon is strongly convex, and has a global opnmum that can be reached by gradient descent! New Goal: Prove that (with high probability) the step (2) approximates the gradient of this funcnon

69 CONDITIONS FOR CONVERGENCE

70 CONDITIONS FOR CONVERGENCE Consider the following general setup: op2mal solu2on: z * update: z s+1 = z s η g s

71 CONDITIONS FOR CONVERGENCE Consider the following general setup: op2mal solu2on: z * update: z s+1 = z s η g s Defini2on: g s is (α, β, ε s )- correlated with z * if for all s: g s,z s - z * α z s - z * β g s - ε s

72 CONDITIONS FOR CONVERGENCE Consider the following general setup: op2mal solu2on: z * update: z s+1 = z s η g s Defini2on: g s is (α, β, ε s )- correlated with z * if for all s: g s,z s - z * α z s - z * β g s - ε s Theorem: If g s is (α, β, ε s )- correlated with z *, then z s - z * 2 (1-2αη) s z 0 - z * 2 + max s ε s α

73 CONDITIONS FOR CONVERGENCE Consider the following general setup: op2mal solu2on: z * update: z s+1 = z s η g s Defini2on: g s is (α, β, ε s )- correlated with z * if for all s: g s,z s - z * α z s - z * β g s - ε s Theorem: If g s is (α, β, ε s )- correlated with z *, then z s - z * 2 (1-2αη) s z 0 - z * 2 + max s ε s α This follows immediately from the usual proof

74 (1) x (i) = threshold(a T b (i) )

75 (1) x (i) = threshold(a T b (i) ) Decoding Lemma: If A is 1/polylog(n)- close to A and A A 2, then decoding recovers the signs correctly (whp)

76 (1) x (i) = threshold(a T b (i) ) Decoding Lemma: If A is 1/polylog(n)- close to A and A A 2, then decoding recovers the signs correctly (whp) (2) q A A + η (b (i) Ax (i) )sgn(x (i) ) T i =1

77 (1) x (i) = threshold(a T b (i) ) Decoding Lemma: If A is 1/polylog(n)- close to A and A A 2, then decoding recovers the signs correctly (whp) (2) q A A + η (b (i) Ax (i) )sgn(x (i) ) T i =1 Key Lemma: ExpectaNon of (the column- wise) update rule is A j A j + ξ (I - A j A T j )A j + ξ E [A R A R T ]A j + error R where R = supp(x)\j, if decoding recovers the correct signs

$Key Lemma: ExpectaNon of (the column- wise) update rule is A j A j + ξ (I - A j A T j )A j + ξ E [A R A R T ]A j + error A j - A j where R = supp(x)\j, if decoding recovers the correct signs$

78 (1) x (i) = threshold(a T b (i) ) Decoding Lemma: If A is 1/polylog(n)- close to A and A A 2, then decoding recovers the signs correctly (whp) (2) q A A + η (b (i) Ax (i) )sgn(x (i) ) T i =1 Key Lemma: ExpectaNon of (the column- wise) update rule is A j A j + ξ (I - A j A T j )A j + ξ E [A R A R T ]A j + error A j - A j where R = supp(x)\j, if decoding recovers the correct signs R

79 (1) x (i) = threshold(a T b (i) ) Decoding Lemma: If A is 1/polylog(n)- close to A and A A 2, then decoding recovers the signs correctly (whp) (2) q A A + η (b (i) Ax (i) )sgn(x (i) ) T i =1 Key Lemma: ExpectaNon of (the column- wise) update rule is A j A j + ξ (I - A j A T j )A j + ξ E [A R A R T ]A j + error A j - A j systemic bias where R = supp(x)\j, if decoding recovers the correct signs R

80 (1) x (i) = threshold(a T b (i) ) Decoding Lemma: If A is 1/polylog(n)- close to A and A A 2, then decoding recovers the signs correctly (whp) (2) q A A + η (b (i) Ax (i) )sgn(x (i) ) T i =1 Key Lemma: ExpectaNon of (the column- wise) update rule is A j A j + ξ (I - A j A T j )A j + ξ E [A R A R T ]A j + error A j - A j systemic bias where R = supp(x)\j, if decoding recovers the correct signs Auxiliary Lemma: A A 2, remains true throughout if η is small enough and q is large enough R

81 Proof: Let ζ denote any vector whose norm is negligible (say, n - ω(1) ).

82 Proof: Let ζ denote any vector whose norm is negligible (say, n - ω(1) ). g j = E[(b Ax)sgn(x j )] is the expected update to A j.

83 Proof: Let ζ denote any vector whose norm is negligible (say, n - ω(1) ). g j = E[(b Ax)sgn(x j )] is the expected update to A j. Let 1 F be the indicator of the event that decoding recovers the signs of x.

84 Proof: Let ζ denote any vector whose norm is negligible (say, n - ω(1) ). g j = E[(b Ax)sgn(x j )] is the expected update to A j. Let 1 F be the indicator of the event that decoding recovers the signs of x. g j = E[(b Ax)sgn(x j ) 1 F ] + E[(b Ax)sgn(x j ) 1 F ]

85 Proof: Let ζ denote any vector whose norm is negligible (say, n - ω(1) ). g j = E[(b Ax)sgn(x j )] is the expected update to A j. Let 1 F be the indicator of the event that decoding recovers the signs of x. g j = E[(b Ax)sgn(x j ) 1 F ] ± ζ

86 Proof: Let ζ denote any vector whose norm is negligible (say, n - ω(1) ). g j = E[(b Ax)sgn(x j )] is the expected update to A j. Let 1 F be the indicator of the event that decoding recovers the signs of x. g j = E[(b Ax)sgn(x j ) 1 F ] ± ζ

87 Proof: Let ζ denote any vector whose norm is negligible (say, n - ω(1) ). g j = E[(b Ax)sgn(x j )] is the expected update to A j. Let 1 F be the indicator of the event that decoding recovers the signs of x. g j = E[(b A threshold(a T b)) sgn(x j ) 1 F ] ± ζ

88 Proof: Let ζ denote any vector whose norm is negligible (say, n - ω(1) ). g j = E[(b Ax)sgn(x j )] is the expected update to A j. Let 1 F be the indicator of the event that decoding recovers the signs of x. Let S = supp(x). T g j = E[(b A S A S b) sgn(x j ) 1 F ] ± ζ

89 Proof: Let ζ denote any vector whose norm is negligible (say, n - ω(1) ). g j = E[(b Ax)sgn(x j )] is the expected update to A j. Let 1 F be the indicator of the event that decoding recovers the signs of x. Let S = supp(x). T g j = E[(b A S A S b) sgn(x j )] T E[(b A S A S b) sgn(x j ) 1 F ] ± ζ

90 Proof: Let ζ denote any vector whose norm is negligible (say, n - ω(1) ). g j = E[(b Ax)sgn(x j )] is the expected update to A j. Let 1 F be the indicator of the event that decoding recovers the signs of x. Let S = supp(x). T g j = E[(I A S A S )Ax sgn(x j )] ± ζ

91 Proof: Let ζ denote any vector whose norm is negligible (say, n - ω(1) ). g j = E[(b Ax)sgn(x j )] is the expected update to A j. Let 1 F be the indicator of the event that decoding recovers the signs of x. Let S = supp(x). T g j = E[(I A S A S )Ax sgn(x j )] ± ζ T = E S E x [[(I A S A S )Ax sgn(x j )] S] ± ζ S

92 Proof: Let ζ denote any vector whose norm is negligible (say, n - ω(1) ). g j = E[(b Ax)sgn(x j )] is the expected update to A j. Let 1 F be the indicator of the event that decoding recovers the signs of x. Let S = supp(x). T g j = E[(I A S A S )Ax sgn(x j )] ± ζ T = E S E x [[(I A S A S )Ax sgn(x j )] S] ± ζ S T = p j E S [(I A S A S )A j ] ± ζ where p j = E[x j sgn(x j ) j in S].

93 Proof: Let ζ denote any vector whose norm is negligible (say, n - ω(1) ). g j = E[(b Ax)sgn(x j )] is the expected update to A j. Let 1 F be the indicator of the event that decoding recovers the signs of x. Let S = supp(x). T g j = E[(I A S A S )Ax sgn(x j )] ± ζ T = E S E x [[(I A S A S )Ax sgn(x j )] S] ± ζ S T = p j E S [(I A S A S )A j ] ± ζ where p j = E[x j sgn(x j ) j in S]. Let q j = Pr[j in S], q i,j = Pr[i,j in S]

94 Proof: Let ζ denote any vector whose norm is negligible (say, n - ω(1) ). g j = E[(b Ax)sgn(x j )] is the expected update to A j. Let 1 F be the indicator of the event that decoding recovers the signs of x. Let S = supp(x). T g j = E[(I A S A S )Ax sgn(x j )] ± ζ T = E S E x [[(I A S A S )Ax sgn(x j )] S] ± ζ S T = p j E S [(I A S A S )A j ] ± ζ where p j = E[x j sgn(x j ) j in S]. Let q j = Pr[j in S], q i,j = Pr[i,j in S] then T T = p j q j (I A j A j )A j + p j A - j diag(q i,j )A - j A j ± ζ

95 Proof: Let ζ denote any vector whose norm is negligible (say, n - ω(1) ). g j = E[(b Ax)sgn(x j )] is the expected update to A j. Let 1 F be the indicator of the event that decoding recovers the signs of x. Let S = supp(x). T g j = E[(I A S A S )Ax sgn(x j )] ± ζ T = E S E x [[(I A S A S )Ax sgn(x j )] S] ± ζ S T = p j E S [(I A S A S )A j ] ± ζ where p j = E[x j sgn(x j ) j in S]. Let q j = Pr[j in S], q i,j = Pr[i,j in S] then T T = p j q j (I A j A j )A j + p j A - j diag(q i,j )A - j A j ± ζ

96 AN INITIALIZATION PROCEDURE We give an ininalizanon algorithm that outputs A that is column- wise δ- close to A for δ 1/polylog(n), A A 2

97 AN INITIALIZATION PROCEDURE We give an ininalizanon algorithm that outputs A that is column- wise δ- close to A for δ 1/polylog(n), A A 2 Repeat: (1) Choose samples b, b

98 AN INITIALIZATION PROCEDURE We give an ininalizanon algorithm that outputs A that is column- wise δ- close to A for δ 1/polylog(n), A A 2 Repeat: (1) Choose samples b, b (2) (b T b (i) ) (b T b (i) ) b (i) (b (i) ) T Set M b,b = 1 q q i = 1

99 AN INITIALIZATION PROCEDURE We give an ininalizanon algorithm that outputs A that is column- wise δ- close to A for δ 1/polylog(n), A A 2 Repeat: (1) Choose samples b, b (2) (b T b (i) ) (b T b (i) ) b (i) (b (i) ) T (3) Set M b,b = 1 q q i = 1 k If λ 1 (M b,b ) > and λ m 2 << output top eigenvector k m logm

100 AN INITIALIZATION PROCEDURE We give an ininalizanon algorithm that outputs A that is column- wise δ- close to A for δ 1/polylog(n), A A 2 Repeat: (1) Choose samples b, b (2) (b T b (i) ) (b T b (i) ) b (i) (b (i) ) T (3) Set M b,b = 1 q q i = 1 k If λ 1 (M b,b ) > and λ m 2 << output top eigenvector k m logm Key Lemma: If Ax = b and Ax = b, then condinon (3) is sansfied if and only if supp(x) supp(x ) = {j}

101 AN INITIALIZATION PROCEDURE We give an ininalizanon algorithm that outputs A that is column- wise δ- close to A for δ 1/polylog(n), A A 2 Repeat: (1) Choose samples b, b (2) (b T b (i) ) (b T b (i) ) b (i) (b (i) ) T (3) Set M b,b = 1 q q i = 1 k If λ 1 (M b,b ) > and λ m 2 << output top eigenvector k m logm Key Lemma: If Ax = b and Ax = b, then condinon (3) is sansfied if and only if supp(x) supp(x ) = {j} in which case, the top eigenvector is δ- close to A j

102 DISCUSSION Our ininalizanon gets us to δ 1/polylog(n), can be neurally implemented with Oja s Rule

103 DISCUSSION Our ininalizanon gets us to δ 1/polylog(n), can be neurally implemented with Oja s Rule Earlier analyses of alternanng minimizanon for δ 1/poly(n) in [Arora, Ge, Moitra 14] and [Agarwal et al 14]

104 DISCUSSION Our ininalizanon gets us to δ 1/polylog(n), can be neurally implemented with Oja s Rule Earlier analyses of alternanng minimizanon for δ 1/poly(n) in [Arora, Ge, Moitra 14] and [Agarwal et al 14] However, in those semngs A and A are so close that the objecnve funcnon is essen2ally convex

105 DISCUSSION Our ininalizanon gets us to δ 1/polylog(n), can be neurally implemented with Oja s Rule Earlier analyses of alternanng minimizanon for δ 1/poly(n) in [Arora, Ge, Moitra 14] and [Agarwal et al 14] However, in those semngs A and A are so close that the objecnve funcnon is essen2ally convex We show that it converges even from mild starnng condinons

106 DISCUSSION Our ininalizanon gets us to δ 1/polylog(n), can be neurally implemented with Oja s Rule Earlier analyses of alternanng minimizanon for δ 1/poly(n) in [Arora, Ge, Moitra 14] and [Agarwal et al 14] However, in those semngs A and A are so close that the objecnve funcnon is essen2ally convex We show that it converges even from mild starnng condinons As a result, our bounds improve on exisnng algorithms in terms of running 2me, sample complexity and sparsity (all but SOS)

107 FURTHER RESULTS AdjusNng an iteranve alg. can have subtle effects on its behavior

108 FURTHER RESULTS AdjusNng an iteranve alg. can have subtle effects on its behavior We can use our framework to systema2cally design/analyze new update rules

109 FURTHER RESULTS AdjusNng an iteranve alg. can have subtle effects on its behavior We can use our framework to systema2cally design/analyze new update rules E.g. we can remove the systemic bias, by carefully projecnng out along the direcnon being updated

110 FURTHER RESULTS AdjusNng an iteranve alg. can have subtle effects on its behavior We can use our framework to systema2cally design/analyze new update rules E.g. we can remove the systemic bias, by carefully projecnng out along the direcnon being updated x (i) = threshold(c T b (i) ) (1) j j where C j = [Proj A (A 1 ) j,proj A (A 2 ), A j Proj A (A m )] j j

111 FURTHER RESULTS AdjusNng an iteranve alg. can have subtle effects on its behavior We can use our framework to systema2cally design/analyze new update rules E.g. we can remove the systemic bias, by carefully projecnng out along the direcnon being updated (1) x (i) j = threshold(c T j b (i) ) where C j = [Proj A (A 1 ) j,proj A (A 2 ), A j Proj A (A m )] j (2) q A A + η (b (i) C x (i) )sgn(x (i) ) T j j j j j i = 1 j

112 Any QuesNons? Summary: Online, local and Hebbian algorithms for sparse coding that find a globally opnmal solunon (whp) Introduced a framework for analyzing iteranve algorithms by thinking of them as trying to minimize an unknown, convex funcnon The key is working with a generanve model Is computa2onal intractability really a barrier to a rigorous theory of neural computanon?

Simple, Efficient and Neural Algorithms for Sparse Coding

Simple, Efficient and Neural Algorithms for Sparse Coding Ankur Moitra (MIT) joint work with Sanjeev Arora, Rong Ge and Tengyu Ma Algorithmic Aspects of Machine Learning (c) 2015 by Ankur Moitra. Note: