Simple, Efficient and Neural Algorithms for Sparse Coding

Size: px
Start display at page:

Download "Simple, Efficient and Neural Algorithms for Sparse Coding"

Transcription

1 Simple, Efficient and Neural Algorithms for Sparse Coding Ankur Moitra (MIT) joint work with Sanjeev Arora, Rong Ge and Tengyu Ma

2 B. A. Olshausen, D. J. Field. Emergence of simple- cell recepnve field propernes by learning a sparse code for natural images, 1996

3 B. A. Olshausen, D. J. Field. Emergence of simple- cell recepnve field propernes by learning a sparse code for natural images, 1996 break natural images into patches: (collecnon of vectors)

4 B. A. Olshausen, D. J. Field. Emergence of simple- cell recepnve field propernes by learning a sparse code for natural images, 1996 break natural images into patches: sparse coding (collecnon of vectors)

5 B. A. Olshausen, D. J. Field. Emergence of simple- cell recepnve field propernes by learning a sparse code for natural images, 1996 break natural images into patches: sparse coding (collecnon of vectors) Proper2es: localized, bandpass and oriented

6 B. A. Olshausen, D. J. Field. Emergence of simple- cell recepnve field propernes by learning a sparse code for natural images, 1996 break natural images into patches: (collecnon of vectors)

7 B. A. Olshausen, D. J. Field. Emergence of simple- cell recepnve field propernes by learning a sparse code for natural images, 1996 break natural images into patches: singular value decomposi2on (collecnon of vectors)

8 B. A. Olshausen, D. J. Field. Emergence of simple- cell recepnve field propernes by learning a sparse code for natural images, 1996 break natural images into patches: singular value decomposi2on Noisy! Difficult to interpret! (collecnon of vectors)

9 OUTLINE Are there efficient, neural algorithms for sparse coding with provable guarantees?

10 OUTLINE Are there efficient, neural algorithms for sparse coding with provable guarantees? Part I: The Olshausen- Field Update Rule A Non- convex FormulaNon Neural ImplementaNon A GeneraNve Model; Prior Work

11 OUTLINE Are there efficient, neural algorithms for sparse coding with provable guarantees? Part I: The Olshausen- Field Update Rule A Non- convex FormulaNon Neural ImplementaNon A GeneraNve Model; Prior Work Part II: A New Update Rule Online, Local and Hebbian with Provable Guarantees ConnecNons to Approximate Gradient Descent Further Extensions

12 More generally, many types of data are sparse in an appropriately chosen basis: b (i) e.g. images, signals, data (n p)

13 More generally, many types of data are sparse in an appropriately chosen basis: dicnonary (n m) A x (i) b (i) e.g. images, signals, representanons (m p) data (n p)

14 More generally, many types of data are sparse in an appropriately chosen basis: dicnonary (n m) at most k non- zeros A x (i) b (i) e.g. images, signals, representanons (m p) data (n p)

15 More generally, many types of data are sparse in an appropriately chosen basis: dicnonary (n m) at most k non- zeros A x (i) b (i) Sparse Coding/ Dic2onary Learning: Can we learn A from examples? e.g. images, signals, representanons (m p) data (n p)

16 NONCONVEX FORMULATIONS Usual approach, minimize reconstruc2on error: p p min A, x (i) s i = 1 b (i) - A x (i) + L(x (i) ) i = 1 non- linear penalty func2on

17 NONCONVEX FORMULATIONS Usual approach, minimize reconstruc2on error: p p min A, x (i) s i = 1 b (i) - A x (i) + L(x (i) ) i = 1 non- linear penalty func2on (encourage sparsity)

18 NONCONVEX FORMULATIONS Usual approach, minimize reconstruc2on error: p p min A, x (i) s i = 1 b (i) - A x (i) + L(x (i) ) i = 1 non- linear penalty func2on (encourage sparsity) This opnmizanon problem is NP- hard, can have many local opnma; but heuris2cs work well nevertheless

19 A NEURAL IMPLEMENTATION [Olshausen, Field]: output dicnonary stored as synapse weights residual image (snmulus) A i,j b j x i L (x i ) r j

20 A NEURAL IMPLEMENTATION [Olshausen, Field]: output dicnonary stored as synapse weights residual image (snmulus) b j

21 A NEURAL IMPLEMENTATION [Olshausen, Field]: output dicnonary stored as synapse weights residual image (snmulus) r j b j

22 A NEURAL IMPLEMENTATION [Olshausen, Field]: output dicnonary stored as synapse weights residual image (snmulus) r j A i,j b j

23 A NEURAL IMPLEMENTATION [Olshausen, Field]: output x i dicnonary stored as synapse weights residual image (snmulus) r j A i,j b j

24 A NEURAL IMPLEMENTATION [Olshausen, Field]: output L (x i ) dicnonary stored as synapse weights residual r j A i,j image (snmulus)

25 This network performs gradient descent on: 2 b - A x + L(x)

26 This network performs gradient descent on: 2 b - A x + L(x) by alternanng between (1) r b Ax (2) x x + η(a T r L(x)) Δ

27 This network performs gradient descent on: 2 b - A x + L(x) by alternanng between (1) r b Ax (2) x x + η(a T r L(x)) Moreover A is updated through Hebbian rules Δ

28 This network performs gradient descent on: 2 b - A x + L(x) by alternanng between (1) r b Ax (2) x x + η(a T r L(x)) Moreover A is updated through Hebbian rules Δ There are no provable guarantees, but works well

29 This network performs gradient descent on: 2 b - A x + L(x) by alternanng between (1) r b Ax (2) x x + η(a T r L(x)) Moreover A is updated through Hebbian rules Δ There are no provable guarantees, but works well But why should gradient descent on a non- convex funcnon work?

30 This network performs gradient descent on: 2 b - A x + L(x) by alternanng between (1) r b Ax (2) x x + η(a T r L(x)) Moreover A is updated through Hebbian rules Δ There are no provable guarantees, but works well But why should gradient descent on a non- convex funcnon work? Are simple, local and Hebbian rules sufficient to find globally opnmal solunons?

31 OTHER APPROACHES, AND APPLICATIONS

32 OTHER APPROACHES, AND APPLICATIONS Signal Processing/Sta2s2cs (MOD, ksvd): De- noising, edge- detecnon, super- resolunon Block compression for images/video

33 OTHER APPROACHES, AND APPLICATIONS Signal Processing/Sta2s2cs (MOD, ksvd): De- noising, edge- detecnon, super- resolunon Block compression for images/video Machine Learning (LBRN06, ): Sparsity as a regularizer to prevent over- fimng Learned sparse reps. play a key role in deep- learning

34 OTHER APPROACHES, AND APPLICATIONS Signal Processing/Sta2s2cs (MOD, ksvd): De- noising, edge- detecnon, super- resolunon Block compression for images/video Machine Learning (LBRN06, ): Sparsity as a regularizer to prevent over- fimng Learned sparse reps. play a key role in deep- learning Theore2cal Computer Science (SWW13, AGM14, AAJNT14): New algorithms with provable guarantees, in a natural genera2ve model

35 Genera2ve Model: unknown dicnonary A generate x with support of size k u.a.r., choose non- zero values independently, observe b = Ax

36 Genera2ve Model: unknown dicnonary A generate x with support of size k u.a.r., choose non- zero values independently, observe b = Ax [Spielman, Wang, Wright 13]: works for full coln rank A up to sparsity roughly n ½ (hence m n)

37 Genera2ve Model: unknown dicnonary A generate x with support of size k u.a.r., choose non- zero values independently, observe b = Ax [Spielman, Wang, Wright 13]: works for full coln rank A up to sparsity roughly n ½ (hence m n) [Arora, Ge, Moitra 14]: works for overcomplete, μ- incoherent A up to sparsity roughly n ½- ε /μ

38 Genera2ve Model: unknown dicnonary A generate x with support of size k u.a.r., choose non- zero values independently, observe b = Ax [Spielman, Wang, Wright 13]: works for full coln rank A up to sparsity roughly n ½ (hence m n) [Arora, Ge, Moitra 14]: works for overcomplete, μ- incoherent A up to sparsity roughly n ½- ε /μ [Agarwal et al. 14]: works for overcomplete, μ- incoherent A up to sparsity roughly n ¼ /μ, via alternanng minimizanon

39 Genera2ve Model: unknown dicnonary A generate x with support of size k u.a.r., choose non- zero values independently, observe b = Ax [Spielman, Wang, Wright 13]: works for full coln rank A up to sparsity roughly n ½ (hence m n) [Arora, Ge, Moitra 14]: works for overcomplete, μ- incoherent A up to sparsity roughly n ½- ε /μ [Agarwal et al. 14]: works for overcomplete, μ- incoherent A up to sparsity roughly n ¼ /μ, via alternanng minimizanon [Barak, Kelner, Steurer 14]: works for overcomplete A up to sparsity roughly n 1- ε, but running Nme is exponen2al in accuracy

40 OUR RESULTS Suppose k n/μ polylog(n) and A n polylog(n) Suppose A that is column- wise δ- close to A for δ 1/polylog(n)

41 OUR RESULTS Suppose k n/μ polylog(n) and A n polylog(n) Suppose A that is column- wise δ- close to A for δ 1/polylog(n) Theorem [Arora, Ge, Ma, Moitra 14]: There is a variant of the OF- update rule that converges to the true dicnonary at a geometric rate, and uses a polynomial number of samples

42 OUR RESULTS Suppose k n/μ polylog(n) and A n polylog(n) Suppose A that is column- wise δ- close to A for δ 1/polylog(n) Theorem [Arora, Ge, Ma, Moitra 14]: There is a variant of the OF- update rule that converges to the true dicnonary at a geometric rate, and uses a polynomial number of samples All previous algorithms had subopnmal sparsity, worked in less generality, or were exponen2al in a natural parameter

43 OUR RESULTS Suppose k n/μ polylog(n) and A n polylog(n) Suppose A that is column- wise δ- close to A for δ 1/polylog(n) Theorem [Arora, Ge, Ma, Moitra 14]: There is a variant of the OF- update rule that converges to the true dicnonary at a geometric rate, and uses a polynomial number of samples All previous algorithms had subopnmal sparsity, worked in less generality, or were exponen2al in a natural parameter Note: k n/2μ is a barrier, even for sparse recovery

44 OUR RESULTS Suppose k n/μ polylog(n) and A n polylog(n) Suppose A that is column- wise δ- close to A for δ 1/polylog(n) Theorem [Arora, Ge, Ma, Moitra 14]: There is a variant of the OF- update rule that converges to the true dicnonary at a geometric rate, and uses a polynomial number of samples All previous algorithms had subopnmal sparsity, worked in less generality, or were exponen2al in a natural parameter Note: k n/2μ is a barrier, even for sparse recovery i.e. if k > n/2μ, then x is not necessarily the sparsest soln to Ax = b

45 OUTLINE Are there efficient, neural algorithms for sparse coding with provable guarantees? Part I: The Olshausen- Field Update Rule A Non- convex FormulaNon Neural ImplementaNon A GeneraNve Model; Prior Work Part II: A New Update Rule Online, Local and Hebbian with Provable Guarantees ConnecNons to Approximate Gradient Descent Further Extensions

46 A NEW UPDATE RULE Alternate between the following steps (size q batches): (1) (2) x (i) = threshold(a T b (i) ) q A A + η (b (i) Ax (i) )sgn(x (i) ) T i = 1

47 A NEW UPDATE RULE Alternate between the following steps (size q batches): (1) (2) x (i) = threshold(a T b (i) ) q (zero out small entries) A A + η (b (i) Ax (i) )sgn(x (i) ) T i = 1

48 A NEW UPDATE RULE Alternate between the following steps (size q batches): (1) (2) x (i) = threshold(a T b (i) ) q A A + η (b (i) Ax (i) )sgn(x (i) ) T i = 1

49 A NEW UPDATE RULE Alternate between the following steps (size q batches): (1) (2) x (i) = threshold(a T b (i) ) q A A + η (b (i) Ax (i) )sgn(x (i) ) T i = 1 The samples arrive online

50 A NEW UPDATE RULE Alternate between the following steps (size q batches): (1) (2) x (i) = threshold(a T b (i) ) q A A + η (b (i) Ax (i) )sgn(x (i) ) T i = 1 The samples arrive online In contrast, previous (provable) algorithms might need to compute a new esnmate from scratch, when new samples arrive

51 A NEW UPDATE RULE Alternate between the following steps (size q batches): (1) (2) x (i) = threshold(a T b (i) ) q A A + η (b (i) Ax (i) )sgn(x (i) ) T i = 1

52 A NEW UPDATE RULE Alternate between the following steps (size q batches): (1) (2) x (i) = threshold(a T b (i) ) q A A + η (b (i) Ax (i) )sgn(x (i) ) T i = 1 The computanon is local

53 A NEW UPDATE RULE Alternate between the following steps (size q batches): (1) (2) x (i) = threshold(a T b (i) ) q A A + η (b (i) Ax (i) )sgn(x (i) ) T i = 1 The computanon is local In parncular, the output is a thresholded, weighted sum of acnvanons

54 A NEW UPDATE RULE Alternate between the following steps (size q batches): (1) (2) x (i) = threshold(a T b (i) ) q A A + η (b (i) Ax (i) )sgn(x (i) ) T i = 1

55 A NEW UPDATE RULE Alternate between the following steps (size q batches): (1) (2) x (i) = threshold(a T b (i) ) q A A + η (b (i) Ax (i) )sgn(x (i) ) T i = 1 The update rule is explicitly Hebbian

56 A NEW UPDATE RULE Alternate between the following steps (size q batches): (1) (2) x (i) = threshold(a T b (i) ) q A A + η (b (i) Ax (i) )sgn(x (i) ) T i = 1 The update rule is explicitly Hebbian neurons that fire together, wire together

57 A NEW UPDATE RULE Alternate between the following steps (size q batches): (1) (2) x (i) = threshold(a T b (i) ) q A A + η (b (i) Ax (i) )sgn(x (i) ) T i = 1 The update rule is explicitly Hebbian

58 A NEW UPDATE RULE Alternate between the following steps (size q batches): (1) (2) x (i) = threshold(a T b (i) ) q A A + η (b (i) Ax (i) )sgn(x (i) ) T i = 1 The update rule is explicitly Hebbian The update to a weight A i,j is the product of the acnvanons at the residual layer and the decoding layer

59 WHAT IS NEURALLY PLAUSIBLE, ANYWAYS? Our update rule (essennally) inherits a neural implementanon from [Olshausen, Field]

60 WHAT IS NEURALLY PLAUSIBLE, ANYWAYS? Our update rule (essennally) inherits a neural implementanon from [Olshausen, Field] However there are many compenng theories for what consntutes a plausible neural implementanon e.g. nonneganve outputs, no bidirecnonal links, etc

61 WHAT IS NEURALLY PLAUSIBLE, ANYWAYS? Our update rule (essennally) inherits a neural implementanon from [Olshausen, Field] However there are many compenng theories for what consntutes a plausible neural implementanon e.g. nonneganve outputs, no bidirecnonal links, etc But ours is online, local and Hebbian, all of which are basic propernes to require The surprise is that such simple building blocks can find globally op2mal solunons to highly non- trivial algorithmic problems!

62 APPROXIMATE GRADIENT DESCENT We give a general framework for designing and analyzing iteranve algorithms for sparse coding

63 APPROXIMATE GRADIENT DESCENT We give a general framework for designing and analyzing iteranve algorithms for sparse coding The usual approach is to think of them as trying to minimize a non- convex funcnon: min A, coln- sparse X E(A, X) = B - A X 2 F

64 APPROXIMATE GRADIENT DESCENT We give a general framework for designing and analyzing iteranve algorithms for sparse coding The usual approach is to think of them as trying to minimize a non- convex funcnon: min A, coln- sparse X E(A, X) = B - A X 2 F colns are b (i) s colns are x (i) s

65 APPROXIMATE GRADIENT DESCENT We give a general framework for designing and analyzing iteranve algorithms for sparse coding How about thinking of them as trying to minimize an unknown, convex funcnon?

66 APPROXIMATE GRADIENT DESCENT We give a general framework for designing and analyzing iteranve algorithms for sparse coding How about thinking of them as trying to minimize an unknown, convex funcnon? min A E(A, X) = B - A X 2 F

67 APPROXIMATE GRADIENT DESCENT We give a general framework for designing and analyzing iteranve algorithms for sparse coding How about thinking of them as trying to minimize an unknown, convex funcnon? min A E(A, X) = B - A X 2 F Now the funcnon is strongly convex, and has a global opnmum that can be reached by gradient descent!

68 APPROXIMATE GRADIENT DESCENT We give a general framework for designing and analyzing iteranve algorithms for sparse coding How about thinking of them as trying to minimize an unknown, convex funcnon? min A E(A, X) = B - A X 2 F Now the funcnon is strongly convex, and has a global opnmum that can be reached by gradient descent! New Goal: Prove that (with high probability) the step (2) approximates the gradient of this funcnon

69 CONDITIONS FOR CONVERGENCE

70 CONDITIONS FOR CONVERGENCE Consider the following general setup: op2mal solu2on: z * update: z s+1 = z s η g s

71 CONDITIONS FOR CONVERGENCE Consider the following general setup: op2mal solu2on: z * update: z s+1 = z s η g s Defini2on: g s is (α, β, ε s )- correlated with z * if for all s: g s,z s - z * α z s - z * β g s - ε s

72 CONDITIONS FOR CONVERGENCE Consider the following general setup: op2mal solu2on: z * update: z s+1 = z s η g s Defini2on: g s is (α, β, ε s )- correlated with z * if for all s: g s,z s - z * α z s - z * β g s - ε s Theorem: If g s is (α, β, ε s )- correlated with z *, then z s - z * 2 (1-2αη) s z 0 - z * 2 + max s ε s α

73 CONDITIONS FOR CONVERGENCE Consider the following general setup: op2mal solu2on: z * update: z s+1 = z s η g s Defini2on: g s is (α, β, ε s )- correlated with z * if for all s: g s,z s - z * α z s - z * β g s - ε s Theorem: If g s is (α, β, ε s )- correlated with z *, then z s - z * 2 (1-2αη) s z 0 - z * 2 + max s ε s α This follows immediately from the usual proof

74 (1) x (i) = threshold(a T b (i) )

75 (1) x (i) = threshold(a T b (i) ) Decoding Lemma: If A is 1/polylog(n)- close to A and A A 2, then decoding recovers the signs correctly (whp)

76 (1) x (i) = threshold(a T b (i) ) Decoding Lemma: If A is 1/polylog(n)- close to A and A A 2, then decoding recovers the signs correctly (whp) (2) q A A + η (b (i) Ax (i) )sgn(x (i) ) T i =1

77 (1) x (i) = threshold(a T b (i) ) Decoding Lemma: If A is 1/polylog(n)- close to A and A A 2, then decoding recovers the signs correctly (whp) (2) q A A + η (b (i) Ax (i) )sgn(x (i) ) T i =1 Key Lemma: ExpectaNon of (the column- wise) update rule is A j A j + ξ (I - A j A T j )A j + ξ E [A R A R T ]A j + error R where R = supp(x)\j, if decoding recovers the correct signs

78 (1) x (i) = threshold(a T b (i) ) Decoding Lemma: If A is 1/polylog(n)- close to A and A A 2, then decoding recovers the signs correctly (whp) (2) q A A + η (b (i) Ax (i) )sgn(x (i) ) T i =1 Key Lemma: ExpectaNon of (the column- wise) update rule is A j A j + ξ (I - A j A T j )A j + ξ E [A R A R T ]A j + error A j - A j where R = supp(x)\j, if decoding recovers the correct signs R

79 (1) x (i) = threshold(a T b (i) ) Decoding Lemma: If A is 1/polylog(n)- close to A and A A 2, then decoding recovers the signs correctly (whp) (2) q A A + η (b (i) Ax (i) )sgn(x (i) ) T i =1 Key Lemma: ExpectaNon of (the column- wise) update rule is A j A j + ξ (I - A j A T j )A j + ξ E [A R A R T ]A j + error A j - A j systemic bias where R = supp(x)\j, if decoding recovers the correct signs R

80 (1) x (i) = threshold(a T b (i) ) Decoding Lemma: If A is 1/polylog(n)- close to A and A A 2, then decoding recovers the signs correctly (whp) (2) q A A + η (b (i) Ax (i) )sgn(x (i) ) T i =1 Key Lemma: ExpectaNon of (the column- wise) update rule is A j A j + ξ (I - A j A T j )A j + ξ E [A R A R T ]A j + error A j - A j systemic bias where R = supp(x)\j, if decoding recovers the correct signs Auxiliary Lemma: A A 2, remains true throughout if η is small enough and q is large enough R

81 Proof: Let ζ denote any vector whose norm is negligible (say, n - ω(1) ).

82 Proof: Let ζ denote any vector whose norm is negligible (say, n - ω(1) ). g j = E[(b Ax)sgn(x j )] is the expected update to A j.

83 Proof: Let ζ denote any vector whose norm is negligible (say, n - ω(1) ). g j = E[(b Ax)sgn(x j )] is the expected update to A j. Let 1 F be the indicator of the event that decoding recovers the signs of x.

84 Proof: Let ζ denote any vector whose norm is negligible (say, n - ω(1) ). g j = E[(b Ax)sgn(x j )] is the expected update to A j. Let 1 F be the indicator of the event that decoding recovers the signs of x. g j = E[(b Ax)sgn(x j ) 1 F ] + E[(b Ax)sgn(x j ) 1 F ]

85 Proof: Let ζ denote any vector whose norm is negligible (say, n - ω(1) ). g j = E[(b Ax)sgn(x j )] is the expected update to A j. Let 1 F be the indicator of the event that decoding recovers the signs of x. g j = E[(b Ax)sgn(x j ) 1 F ] ± ζ

86 Proof: Let ζ denote any vector whose norm is negligible (say, n - ω(1) ). g j = E[(b Ax)sgn(x j )] is the expected update to A j. Let 1 F be the indicator of the event that decoding recovers the signs of x. g j = E[(b Ax)sgn(x j ) 1 F ] ± ζ

87 Proof: Let ζ denote any vector whose norm is negligible (say, n - ω(1) ). g j = E[(b Ax)sgn(x j )] is the expected update to A j. Let 1 F be the indicator of the event that decoding recovers the signs of x. g j = E[(b A threshold(a T b)) sgn(x j ) 1 F ] ± ζ

88 Proof: Let ζ denote any vector whose norm is negligible (say, n - ω(1) ). g j = E[(b Ax)sgn(x j )] is the expected update to A j. Let 1 F be the indicator of the event that decoding recovers the signs of x. Let S = supp(x). T g j = E[(b A S A S b) sgn(x j ) 1 F ] ± ζ

89 Proof: Let ζ denote any vector whose norm is negligible (say, n - ω(1) ). g j = E[(b Ax)sgn(x j )] is the expected update to A j. Let 1 F be the indicator of the event that decoding recovers the signs of x. Let S = supp(x). T g j = E[(b A S A S b) sgn(x j )] T E[(b A S A S b) sgn(x j ) 1 F ] ± ζ

90 Proof: Let ζ denote any vector whose norm is negligible (say, n - ω(1) ). g j = E[(b Ax)sgn(x j )] is the expected update to A j. Let 1 F be the indicator of the event that decoding recovers the signs of x. Let S = supp(x). T g j = E[(I A S A S )Ax sgn(x j )] ± ζ

91 Proof: Let ζ denote any vector whose norm is negligible (say, n - ω(1) ). g j = E[(b Ax)sgn(x j )] is the expected update to A j. Let 1 F be the indicator of the event that decoding recovers the signs of x. Let S = supp(x). T g j = E[(I A S A S )Ax sgn(x j )] ± ζ T = E S E x [[(I A S A S )Ax sgn(x j )] S] ± ζ S

92 Proof: Let ζ denote any vector whose norm is negligible (say, n - ω(1) ). g j = E[(b Ax)sgn(x j )] is the expected update to A j. Let 1 F be the indicator of the event that decoding recovers the signs of x. Let S = supp(x). T g j = E[(I A S A S )Ax sgn(x j )] ± ζ T = E S E x [[(I A S A S )Ax sgn(x j )] S] ± ζ S T = p j E S [(I A S A S )A j ] ± ζ where p j = E[x j sgn(x j ) j in S].

93 Proof: Let ζ denote any vector whose norm is negligible (say, n - ω(1) ). g j = E[(b Ax)sgn(x j )] is the expected update to A j. Let 1 F be the indicator of the event that decoding recovers the signs of x. Let S = supp(x). T g j = E[(I A S A S )Ax sgn(x j )] ± ζ T = E S E x [[(I A S A S )Ax sgn(x j )] S] ± ζ S T = p j E S [(I A S A S )A j ] ± ζ where p j = E[x j sgn(x j ) j in S]. Let q j = Pr[j in S], q i,j = Pr[i,j in S]

94 Proof: Let ζ denote any vector whose norm is negligible (say, n - ω(1) ). g j = E[(b Ax)sgn(x j )] is the expected update to A j. Let 1 F be the indicator of the event that decoding recovers the signs of x. Let S = supp(x). T g j = E[(I A S A S )Ax sgn(x j )] ± ζ T = E S E x [[(I A S A S )Ax sgn(x j )] S] ± ζ S T = p j E S [(I A S A S )A j ] ± ζ where p j = E[x j sgn(x j ) j in S]. Let q j = Pr[j in S], q i,j = Pr[i,j in S] then T T = p j q j (I A j A j )A j + p j A - j diag(q i,j )A - j A j ± ζ

95 Proof: Let ζ denote any vector whose norm is negligible (say, n - ω(1) ). g j = E[(b Ax)sgn(x j )] is the expected update to A j. Let 1 F be the indicator of the event that decoding recovers the signs of x. Let S = supp(x). T g j = E[(I A S A S )Ax sgn(x j )] ± ζ T = E S E x [[(I A S A S )Ax sgn(x j )] S] ± ζ S T = p j E S [(I A S A S )A j ] ± ζ where p j = E[x j sgn(x j ) j in S]. Let q j = Pr[j in S], q i,j = Pr[i,j in S] then T T = p j q j (I A j A j )A j + p j A - j diag(q i,j )A - j A j ± ζ

96 AN INITIALIZATION PROCEDURE We give an ininalizanon algorithm that outputs A that is column- wise δ- close to A for δ 1/polylog(n), A A 2

97 AN INITIALIZATION PROCEDURE We give an ininalizanon algorithm that outputs A that is column- wise δ- close to A for δ 1/polylog(n), A A 2 Repeat: (1) Choose samples b, b

98 AN INITIALIZATION PROCEDURE We give an ininalizanon algorithm that outputs A that is column- wise δ- close to A for δ 1/polylog(n), A A 2 Repeat: (1) Choose samples b, b (2) (b T b (i) ) (b T b (i) ) b (i) (b (i) ) T Set M b,b = 1 q q i = 1

99 AN INITIALIZATION PROCEDURE We give an ininalizanon algorithm that outputs A that is column- wise δ- close to A for δ 1/polylog(n), A A 2 Repeat: (1) Choose samples b, b (2) (b T b (i) ) (b T b (i) ) b (i) (b (i) ) T (3) Set M b,b = 1 q q i = 1 k If λ 1 (M b,b ) > and λ m 2 << output top eigenvector k m logm

100 AN INITIALIZATION PROCEDURE We give an ininalizanon algorithm that outputs A that is column- wise δ- close to A for δ 1/polylog(n), A A 2 Repeat: (1) Choose samples b, b (2) (b T b (i) ) (b T b (i) ) b (i) (b (i) ) T (3) Set M b,b = 1 q q i = 1 k If λ 1 (M b,b ) > and λ m 2 << output top eigenvector k m logm Key Lemma: If Ax = b and Ax = b, then condinon (3) is sansfied if and only if supp(x) supp(x ) = {j}

101 AN INITIALIZATION PROCEDURE We give an ininalizanon algorithm that outputs A that is column- wise δ- close to A for δ 1/polylog(n), A A 2 Repeat: (1) Choose samples b, b (2) (b T b (i) ) (b T b (i) ) b (i) (b (i) ) T (3) Set M b,b = 1 q q i = 1 k If λ 1 (M b,b ) > and λ m 2 << output top eigenvector k m logm Key Lemma: If Ax = b and Ax = b, then condinon (3) is sansfied if and only if supp(x) supp(x ) = {j} in which case, the top eigenvector is δ- close to A j

102 DISCUSSION Our ininalizanon gets us to δ 1/polylog(n), can be neurally implemented with Oja s Rule

103 DISCUSSION Our ininalizanon gets us to δ 1/polylog(n), can be neurally implemented with Oja s Rule Earlier analyses of alternanng minimizanon for δ 1/poly(n) in [Arora, Ge, Moitra 14] and [Agarwal et al 14]

104 DISCUSSION Our ininalizanon gets us to δ 1/polylog(n), can be neurally implemented with Oja s Rule Earlier analyses of alternanng minimizanon for δ 1/poly(n) in [Arora, Ge, Moitra 14] and [Agarwal et al 14] However, in those semngs A and A are so close that the objecnve funcnon is essen2ally convex

105 DISCUSSION Our ininalizanon gets us to δ 1/polylog(n), can be neurally implemented with Oja s Rule Earlier analyses of alternanng minimizanon for δ 1/poly(n) in [Arora, Ge, Moitra 14] and [Agarwal et al 14] However, in those semngs A and A are so close that the objecnve funcnon is essen2ally convex We show that it converges even from mild starnng condinons

106 DISCUSSION Our ininalizanon gets us to δ 1/polylog(n), can be neurally implemented with Oja s Rule Earlier analyses of alternanng minimizanon for δ 1/poly(n) in [Arora, Ge, Moitra 14] and [Agarwal et al 14] However, in those semngs A and A are so close that the objecnve funcnon is essen2ally convex We show that it converges even from mild starnng condinons As a result, our bounds improve on exisnng algorithms in terms of running 2me, sample complexity and sparsity (all but SOS)

107 FURTHER RESULTS AdjusNng an iteranve alg. can have subtle effects on its behavior

108 FURTHER RESULTS AdjusNng an iteranve alg. can have subtle effects on its behavior We can use our framework to systema2cally design/analyze new update rules

109 FURTHER RESULTS AdjusNng an iteranve alg. can have subtle effects on its behavior We can use our framework to systema2cally design/analyze new update rules E.g. we can remove the systemic bias, by carefully projecnng out along the direcnon being updated

110 FURTHER RESULTS AdjusNng an iteranve alg. can have subtle effects on its behavior We can use our framework to systema2cally design/analyze new update rules E.g. we can remove the systemic bias, by carefully projecnng out along the direcnon being updated x (i) = threshold(c T b (i) ) (1) j j where C j = [Proj A (A 1 ) j,proj A (A 2 ), A j Proj A (A m )] j j

111 FURTHER RESULTS AdjusNng an iteranve alg. can have subtle effects on its behavior We can use our framework to systema2cally design/analyze new update rules E.g. we can remove the systemic bias, by carefully projecnng out along the direcnon being updated (1) x (i) j = threshold(c T j b (i) ) where C j = [Proj A (A 1 ) j,proj A (A 2 ), A j Proj A (A m )] j (2) q A A + η (b (i) C x (i) )sgn(x (i) ) T j j j j j i = 1 j

112 Any QuesNons? Summary: Online, local and Hebbian algorithms for sparse coding that find a globally opnmal solunon (whp) Introduced a framework for analyzing iteranve algorithms by thinking of them as trying to minimize an unknown, convex funcnon The key is working with a generanve model Is computa2onal intractability really a barrier to a rigorous theory of neural computanon?

Simple, Efficient and Neural Algorithms for Sparse Coding

Simple, Efficient and Neural Algorithms for Sparse Coding Simple, Efficient and Neural Algorithms for Sparse Coding Ankur Moitra (MIT) joint work with Sanjeev Arora, Rong Ge and Tengyu Ma Algorithmic Aspects of Machine Learning (c) 2015 by Ankur Moitra. Note:

More information

Simple, Efficient, and Neural Algorithms for Sparse Coding

Simple, Efficient, and Neural Algorithms for Sparse Coding Sanjeev Arora Princeton University, Computer Science Department Rong Ge Microsoft Research Tengyu Ma Princeton University, Computer Science Department Ankur Moitra MIT, Department of Mathematics and CSAIL

More information

Provable Alternating Minimization Methods for Non-convex Optimization

Provable Alternating Minimization Methods for Non-convex Optimization Provable Alternating Minimization Methods for Non-convex Optimization Prateek Jain Microsoft Research, India Joint work with Praneeth Netrapalli, Sujay Sanghavi, Alekh Agarwal, Animashree Anandkumar, Rashish

More information

Compressed Sensing and Neural Networks

Compressed Sensing and Neural Networks and Jan Vybíral (Charles University & Czech Technical University Prague, Czech Republic) NOMAD Summer Berlin, September 25-29, 2017 1 / 31 Outline Lasso & Introduction Notation Training the network Applications

More information

Why Sparse Coding Works

Why Sparse Coding Works Why Sparse Coding Works Mathematical Challenges in Deep Learning Shaowei Lin (UC Berkeley) shaowei@math.berkeley.edu 10 Aug 2011 Deep Learning Kickoff Meeting What is Sparse Coding? There are many formulations

More information

Sum-of-Squares Method, Tensor Decomposition, Dictionary Learning

Sum-of-Squares Method, Tensor Decomposition, Dictionary Learning Sum-of-Squares Method, Tensor Decomposition, Dictionary Learning David Steurer Cornell Approximation Algorithms and Hardness, Banff, August 2014 for many problems (e.g., all UG-hard ones): better guarantees

More information

Towards Learning Sparsely Used Dictionaries with Arbitrary Supports

Towards Learning Sparsely Used Dictionaries with Arbitrary Supports 2018 IEEE 59th Annual Symposium on Foundations of Computer Science Towards Learning Sparsely Used Dictionaries with Arbitrary Supports Pranjal Awasthi Department of Computer Science utgers University Email:

More information

Algorithms Reading Group Notes: Provable Bounds for Learning Deep Representations

Algorithms Reading Group Notes: Provable Bounds for Learning Deep Representations Algorithms Reading Group Notes: Provable Bounds for Learning Deep Representations Joshua R. Wang November 1, 2016 1 Model and Results Continuing from last week, we again examine provable algorithms for

More information

Structured matrix factorizations. Example: Eigenfaces

Structured matrix factorizations. Example: Eigenfaces Structured matrix factorizations Example: Eigenfaces An extremely large variety of interesting and important problems in machine learning can be formulated as: Given a matrix, find a matrix and a matrix

More information

approximation algorithms I

approximation algorithms I SUM-OF-SQUARES method and approximation algorithms I David Steurer Cornell Cargese Workshop, 201 meta-task encoded as low-degree polynomial in R x example: f(x) = i,j n w ij x i x j 2 given: functions

More information

arxiv: v1 [cs.it] 26 Oct 2018

arxiv: v1 [cs.it] 26 Oct 2018 Outlier Detection using Generative Models with Theoretical Performance Guarantees arxiv:1810.11335v1 [cs.it] 6 Oct 018 Jirong Yi Anh Duc Le Tianming Wang Xiaodong Wu Weiyu Xu October 9, 018 Abstract This

More information

Rapid, Robust, and Reliable Blind Deconvolution via Nonconvex Optimization

Rapid, Robust, and Reliable Blind Deconvolution via Nonconvex Optimization Rapid, Robust, and Reliable Blind Deconvolution via Nonconvex Optimization Shuyang Ling Department of Mathematics, UC Davis Oct.18th, 2016 Shuyang Ling (UC Davis) 16w5136, Oaxaca, Mexico Oct.18th, 2016

More information

A Provable Approach for Double-Sparse Coding

A Provable Approach for Double-Sparse Coding A Provable Approach for Double-Sparse Coding Thanh V. Nguyen ECE Department Iowa State University thanhng@iastate.edu Raymond K. W. Wong Statistics Department Texas A&M University raywong@tamu.edu Chinmay

More information

CSC 576: Variants of Sparse Learning

CSC 576: Variants of Sparse Learning CSC 576: Variants of Sparse Learning Ji Liu Department of Computer Science, University of Rochester October 27, 205 Introduction Our previous note basically suggests using l norm to enforce sparsity in

More information

Machine Learning for Signal Processing Sparse and Overcomplete Representations

Machine Learning for Signal Processing Sparse and Overcomplete Representations Machine Learning for Signal Processing Sparse and Overcomplete Representations Abelino Jimenez (slides from Bhiksha Raj and Sourish Chaudhuri) Oct 1, 217 1 So far Weights Data Basis Data Independent ICA

More information

A Practical Algorithm for Topic Modeling with Provable Guarantees

A Practical Algorithm for Topic Modeling with Provable Guarantees 1 A Practical Algorithm for Topic Modeling with Provable Guarantees Sanjeev Arora, Rong Ge, Yoni Halpern, David Mimno, Ankur Moitra, David Sontag, Yichen Wu, Michael Zhu Reviewed by Zhao Song December

More information

SPARSE signal representations have gained popularity in recent

SPARSE signal representations have gained popularity in recent 6958 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 10, OCTOBER 2011 Blind Compressed Sensing Sivan Gleichman and Yonina C. Eldar, Senior Member, IEEE Abstract The fundamental principle underlying

More information

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big

More information

Compressed sensing. Or: the equation Ax = b, revisited. Terence Tao. Mahler Lecture Series. University of California, Los Angeles

Compressed sensing. Or: the equation Ax = b, revisited. Terence Tao. Mahler Lecture Series. University of California, Los Angeles Or: the equation Ax = b, revisited University of California, Los Angeles Mahler Lecture Series Acquiring signals Many types of real-world signals (e.g. sound, images, video) can be viewed as an n-dimensional

More information

sparse and low-rank tensor recovery Cubic-Sketching

sparse and low-rank tensor recovery Cubic-Sketching Sparse and Low-Ran Tensor Recovery via Cubic-Setching Guang Cheng Department of Statistics Purdue University www.science.purdue.edu/bigdata CCAM@Purdue Math Oct. 27, 2017 Joint wor with Botao Hao and Anru

More information

MLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT

MLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT MLCC 2018 Variable Selection and Sparsity Lorenzo Rosasco UNIGE-MIT-IIT Outline Variable Selection Subset Selection Greedy Methods: (Orthogonal) Matching Pursuit Convex Relaxation: LASSO & Elastic Net

More information

Robust Principal Component Analysis

Robust Principal Component Analysis ELE 538B: Mathematics of High-Dimensional Data Robust Principal Component Analysis Yuxin Chen Princeton University, Fall 2018 Disentangling sparse and low-rank matrices Suppose we are given a matrix M

More information

Non-convex Robust PCA: Provable Bounds

Non-convex Robust PCA: Provable Bounds Non-convex Robust PCA: Provable Bounds Anima Anandkumar U.C. Irvine Joint work with Praneeth Netrapalli, U.N. Niranjan, Prateek Jain and Sujay Sanghavi. Learning with Big Data High Dimensional Regime Missing

More information

arxiv: v2 [stat.ml] 12 Dec 2017

arxiv: v2 [stat.ml] 12 Dec 2017 Provably Accurate Double-Sparse Coding arxiv:1711.03638v2 [stat.ml] 12 Dec 2017 Thanh V. Nguyen Iowa State University, ECE Department Raymond K. W. Wong Texas A&M University, Statistics Department Chinmay

More information

Recovery of Sparse Signals from Noisy Measurements Using an l p -Regularized Least-Squares Algorithm

Recovery of Sparse Signals from Noisy Measurements Using an l p -Regularized Least-Squares Algorithm Recovery of Sparse Signals from Noisy Measurements Using an l p -Regularized Least-Squares Algorithm J. K. Pant, W.-S. Lu, and A. Antoniou University of Victoria August 25, 2011 Compressive Sensing 1 University

More information

Elaine T. Hale, Wotao Yin, Yin Zhang

Elaine T. Hale, Wotao Yin, Yin Zhang , Wotao Yin, Yin Zhang Department of Computational and Applied Mathematics Rice University McMaster University, ICCOPT II-MOPTA 2007 August 13, 2007 1 with Noise 2 3 4 1 with Noise 2 3 4 1 with Noise 2

More information

Linear dimensionality reduction for data analysis

Linear dimensionality reduction for data analysis Linear dimensionality reduction for data analysis Nicolas Gillis Joint work with Robert Luce, François Glineur, Stephen Vavasis, Robert Plemmons, Gabriella Casalino The setup Dimensionality reduction for

More information

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann (Feed-Forward) Neural Networks 2016-12-06 Dr. Hajira Jabeen, Prof. Jens Lehmann Outline In the previous lectures we have learned about tensors and factorization methods. RESCAL is a bilinear model for

More information

Recoverabilty Conditions for Rankings Under Partial Information

Recoverabilty Conditions for Rankings Under Partial Information Recoverabilty Conditions for Rankings Under Partial Information Srikanth Jagabathula Devavrat Shah Abstract We consider the problem of exact recovery of a function, defined on the space of permutations

More information

Dictionary Learning Using Tensor Methods

Dictionary Learning Using Tensor Methods Dictionary Learning Using Tensor Methods Anima Anandkumar U.C. Irvine Joint work with Rong Ge, Majid Janzamin and Furong Huang. Feature learning as cornerstone of ML ML Practice Feature learning as cornerstone

More information

Solving Corrupted Quadratic Equations, Provably

Solving Corrupted Quadratic Equations, Provably Solving Corrupted Quadratic Equations, Provably Yuejie Chi London Workshop on Sparse Signal Processing September 206 Acknowledgement Joint work with Yuanxin Li (OSU), Huishuai Zhuang (Syracuse) and Yingbin

More information

New Coherence and RIP Analysis for Weak. Orthogonal Matching Pursuit

New Coherence and RIP Analysis for Weak. Orthogonal Matching Pursuit New Coherence and RIP Analysis for Wea 1 Orthogonal Matching Pursuit Mingrui Yang, Member, IEEE, and Fran de Hoog arxiv:1405.3354v1 [cs.it] 14 May 2014 Abstract In this paper we define a new coherence

More information

Introduction to Compressed Sensing

Introduction to Compressed Sensing Introduction to Compressed Sensing Alejandro Parada, Gonzalo Arce University of Delaware August 25, 2016 Motivation: Classical Sampling 1 Motivation: Classical Sampling Issues Some applications Radar Spectral

More information

CAREER: Algorithmic Aspects of Machine Learning

CAREER: Algorithmic Aspects of Machine Learning Proposal Summary Over the past few decades an uncomfortable truth has set in that worst case analysis is not the right framework for studying machine learning: every model that is interesting enough to

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Low-rank matrix recovery via nonconvex optimization Yuejie Chi Department of Electrical and Computer Engineering Spring

More information

Mathematical introduction to Compressed Sensing

Mathematical introduction to Compressed Sensing Mathematical introduction to Compressed Sensing Lesson 1 : measurements and sparsity Guillaume Lecué ENSAE Mardi 31 janvier 2016 Guillaume Lecué (ENSAE) Compressed Sensing Mardi 31 janvier 2016 1 / 31

More information

Sketching for Large-Scale Learning of Mixture Models

Sketching for Large-Scale Learning of Mixture Models Sketching for Large-Scale Learning of Mixture Models Nicolas Keriven Université Rennes 1, Inria Rennes Bretagne-atlantique Adv. Rémi Gribonval Outline Introduction Practical Approach Results Theoretical

More information

Recent Developments in Compressed Sensing

Recent Developments in Compressed Sensing Recent Developments in Compressed Sensing M. Vidyasagar Distinguished Professor, IIT Hyderabad m.vidyasagar@iith.ac.in, www.iith.ac.in/ m vidyasagar/ ISL Seminar, Stanford University, 19 April 2018 Outline

More information

Stochastic Composition Optimization

Stochastic Composition Optimization Stochastic Composition Optimization Algorithms and Sample Complexities Mengdi Wang Joint works with Ethan X. Fang, Han Liu, and Ji Liu ORFE@Princeton ICCOPT, Tokyo, August 8-11, 2016 1 / 24 Collaborators

More information

Bias-free Sparse Regression with Guaranteed Consistency

Bias-free Sparse Regression with Guaranteed Consistency Bias-free Sparse Regression with Guaranteed Consistency Wotao Yin (UCLA Math) joint with: Stanley Osher, Ming Yan (UCLA) Feng Ruan, Jiechao Xiong, Yuan Yao (Peking U) UC Riverside, STATS Department March

More information

Tutorial: PART 2. Optimization for Machine Learning. Elad Hazan Princeton University. + help from Sanjeev Arora & Yoram Singer

Tutorial: PART 2. Optimization for Machine Learning. Elad Hazan Princeton University. + help from Sanjeev Arora & Yoram Singer Tutorial: PART 2 Optimization for Machine Learning Elad Hazan Princeton University + help from Sanjeev Arora & Yoram Singer Agenda 1. Learning as mathematical optimization Stochastic optimization, ERM,

More information

EUSIPCO

EUSIPCO EUSIPCO 013 1569746769 SUBSET PURSUIT FOR ANALYSIS DICTIONARY LEARNING Ye Zhang 1,, Haolong Wang 1, Tenglong Yu 1, Wenwu Wang 1 Department of Electronic and Information Engineering, Nanchang University,

More information

An algebraic perspective on integer sparse recovery

An algebraic perspective on integer sparse recovery An algebraic perspective on integer sparse recovery Lenny Fukshansky Claremont McKenna College (joint work with Deanna Needell and Benny Sudakov) Combinatorics Seminar USC October 31, 2018 From Wikipedia:

More information

Designing Information Devices and Systems I Discussion 13B

Designing Information Devices and Systems I Discussion 13B EECS 6A Fall 7 Designing Information Devices and Systems I Discussion 3B. Orthogonal Matching Pursuit Lecture Orthogonal Matching Pursuit (OMP) algorithm: Inputs: A set of m songs, each of length n: S

More information

Least Sparsity of p-norm based Optimization Problems with p > 1

Least Sparsity of p-norm based Optimization Problems with p > 1 Least Sparsity of p-norm based Optimization Problems with p > Jinglai Shen and Seyedahmad Mousavi Original version: July, 07; Revision: February, 08 Abstract Motivated by l p -optimization arising from

More information

ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis

ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis Lecture 3: Sparse signal recovery: A RIPless analysis of l 1 minimization Yuejie Chi The Ohio State University Page 1 Outline

More information

A Quantum Interior Point Method for LPs and SDPs

A Quantum Interior Point Method for LPs and SDPs A Quantum Interior Point Method for LPs and SDPs Iordanis Kerenidis 1 Anupam Prakash 1 1 CNRS, IRIF, Université Paris Diderot, Paris, France. September 26, 2018 Semi Definite Programs A Semidefinite Program

More information

Robustness Meets Algorithms

Robustness Meets Algorithms Robustness Meets Algorithms Ankur Moitra (MIT) ICML 2017 Tutorial, August 6 th CLASSIC PARAMETER ESTIMATION Given samples from an unknown distribution in some class e.g. a 1-D Gaussian can we accurately

More information

New Algorithms for Learning Incoherent and Overcomplete Dictionaries

New Algorithms for Learning Incoherent and Overcomplete Dictionaries New Algorithms for Learning Incoherent and Overcomplete Dictionaries Sanjeev Arora Rong Ge Ankur Moitra May 27, 2014 arxiv:1308.6273v5 [cs.ds] 26 May 2014 Abstract In sparse recovery we are given a matrix

More information

Algorithms for NLP. Language Modeling III. Taylor Berg- Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Algorithms for NLP. Language Modeling III. Taylor Berg- Kirkpatrick CMU Slides: Dan Klein UC Berkeley Algorithms for NLP Language Modeling III Taylor Berg- Kirkpatrick CMU Slides: Dan Klein UC Berkeley Tries Context Encodings [Many details from Pauls and Klein, 2011] Context Encodings Compression Idea:

More information

Info-Greedy Sequential Adaptive Compressed Sensing

Info-Greedy Sequential Adaptive Compressed Sensing Info-Greedy Sequential Adaptive Compressed Sensing Yao Xie Joint work with Gabor Braun and Sebastian Pokutta Georgia Institute of Technology Presented at Allerton Conference 2014 Information sensing for

More information

A tutorial on sparse modeling. Outline:

A tutorial on sparse modeling. Outline: A tutorial on sparse modeling. Outline: 1. Why? 2. What? 3. How. 4. no really, why? Sparse modeling is a component in many state of the art signal processing and machine learning tasks. image processing

More information

Analysis of Greedy Algorithms

Analysis of Greedy Algorithms Analysis of Greedy Algorithms Jiahui Shen Florida State University Oct.26th Outline Introduction Regularity condition Analysis on orthogonal matching pursuit Analysis on forward-backward greedy algorithm

More information

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................

More information

Stochastic Proximal Gradient Algorithm

Stochastic Proximal Gradient Algorithm Stochastic Institut Mines-Télécom / Telecom ParisTech / Laboratoire Traitement et Communication de l Information Joint work with: Y. Atchade, Ann Arbor, USA, G. Fort LTCI/Télécom Paristech and the kind

More information

Proximal Methods for Optimization with Spasity-inducing Norms

Proximal Methods for Optimization with Spasity-inducing Norms Proximal Methods for Optimization with Spasity-inducing Norms Group Learning Presentation Xiaowei Zhou Department of Electronic and Computer Engineering The Hong Kong University of Science and Technology

More information

Overview. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Overview. Optimization-Based Data Analysis.   Carlos Fernandez-Granda Overview Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda 1/25/2016 Sparsity Denoising Regression Inverse problems Low-rank models Matrix completion

More information

Sparse & Redundant Signal Representation, and its Role in Image Processing

Sparse & Redundant Signal Representation, and its Role in Image Processing Sparse & Redundant Signal Representation, and its Role in Michael Elad The CS Department The Technion Israel Institute of technology Haifa 3000, Israel Wave 006 Wavelet and Applications Ecole Polytechnique

More information

A Polynomial Time Algorithm for Lossy Population Recovery

A Polynomial Time Algorithm for Lossy Population Recovery A Polynomial Time Algorithm for Lossy Population Recovery Ankur Moitra Massachusetts Institute of Technology joint work with Mike Saks A Story A Story A Story A Story A Story Can you reconstruct a description

More information

OWL to the rescue of LASSO

OWL to the rescue of LASSO OWL to the rescue of LASSO IISc IBM day 2018 Joint Work R. Sankaran and Francis Bach AISTATS 17 Chiranjib Bhattacharyya Professor, Department of Computer Science and Automation Indian Institute of Science,

More information

Recovering any low-rank matrix, provably

Recovering any low-rank matrix, provably Recovering any low-rank matrix, provably Rachel Ward University of Texas at Austin October, 2014 Joint work with Yudong Chen (U.C. Berkeley), Srinadh Bhojanapalli and Sujay Sanghavi (U.T. Austin) Matrix

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization

More information

Some Recent Advances. in Non-convex Optimization. Purushottam Kar IIT KANPUR

Some Recent Advances. in Non-convex Optimization. Purushottam Kar IIT KANPUR Some Recent Advances in Non-convex Optimization Purushottam Kar IIT KANPUR Outline of the Talk Recap of Convex Optimization Why Non-convex Optimization? Non-convex Optimization: A Brief Introduction Robust

More information

Lecture: Introduction to Compressed Sensing Sparse Recovery Guarantees

Lecture: Introduction to Compressed Sensing Sparse Recovery Guarantees Lecture: Introduction to Compressed Sensing Sparse Recovery Guarantees http://bicmr.pku.edu.cn/~wenzw/bigdata2018.html Acknowledgement: this slides is based on Prof. Emmanuel Candes and Prof. Wotao Yin

More information

GREEDY SIGNAL RECOVERY REVIEW

GREEDY SIGNAL RECOVERY REVIEW GREEDY SIGNAL RECOVERY REVIEW DEANNA NEEDELL, JOEL A. TROPP, ROMAN VERSHYNIN Abstract. The two major approaches to sparse recovery are L 1-minimization and greedy methods. Recently, Needell and Vershynin

More information

Robust Statistics, Revisited

Robust Statistics, Revisited Robust Statistics, Revisited Ankur Moitra (MIT) joint work with Ilias Diakonikolas, Jerry Li, Gautam Kamath, Daniel Kane and Alistair Stewart CLASSIC PARAMETER ESTIMATION Given samples from an unknown

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 11, 2016 Paper presentations and final project proposal Send me the names of your group member (2 or 3 students) before October 15 (this Friday)

More information

On the fast convergence of random perturbations of the gradient flow.

On the fast convergence of random perturbations of the gradient flow. On the fast convergence of random perturbations of the gradient flow. Wenqing Hu. 1 (Joint work with Chris Junchi Li 2.) 1. Department of Mathematics and Statistics, Missouri S&T. 2. Department of Operations

More information

2.3. Clustering or vector quantization 57

2.3. Clustering or vector quantization 57 Multivariate Statistics non-negative matrix factorisation and sparse dictionary learning The PCA decomposition is by construction optimal solution to argmin A R n q,h R q p X AH 2 2 under constraint :

More information

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization / Uses of duality Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Remember conjugate functions Given f : R n R, the function is called its conjugate f (y) = max x R n yt x f(x) Conjugates appear

More information

Rémi Gribonval - PANAMA Project-team INRIA Rennes - Bretagne Atlantique, France

Rémi Gribonval - PANAMA Project-team INRIA Rennes - Bretagne Atlantique, France Sparse dictionary learning in the presence of noise and outliers Rémi Gribonval - PANAMA Project-team INRIA Rennes - Bretagne Atlantique, France remi.gribonval@inria.fr Overview Context: inverse problems

More information

Neural Nets and Symbolic Reasoning Hopfield Networks

Neural Nets and Symbolic Reasoning Hopfield Networks Neural Nets and Symbolic Reasoning Hopfield Networks Outline The idea of pattern completion The fast dynamics of Hopfield networks Learning with Hopfield networks Emerging properties of Hopfield networks

More information

Mini-Course 1: SGD Escapes Saddle Points

Mini-Course 1: SGD Escapes Saddle Points Mini-Course 1: SGD Escapes Saddle Points Yang Yuan Computer Science Department Cornell University Gradient Descent (GD) Task: min x f (x) GD does iterative updates x t+1 = x t η t f (x t ) Gradient Descent

More information

Lecture 24 May 30, 2018

Lecture 24 May 30, 2018 Stats 3C: Theory of Statistics Spring 28 Lecture 24 May 3, 28 Prof. Emmanuel Candes Scribe: Martin J. Zhang, Jun Yan, Can Wang, and E. Candes Outline Agenda: High-dimensional Statistical Estimation. Lasso

More information

Robust Spectral Inference for Joint Stochastic Matrix Factorization

Robust Spectral Inference for Joint Stochastic Matrix Factorization Robust Spectral Inference for Joint Stochastic Matrix Factorization Kun Dong Cornell University October 20, 2016 K. Dong (Cornell University) Robust Spectral Inference for Joint Stochastic Matrix Factorization

More information

BEYOND MATRIX COMPLETION

BEYOND MATRIX COMPLETION BEYOND MATRIX COMPLETION ANKUR MOITRA MASSACHUSETTS INSTITUTE OF TECHNOLOGY Based on joint work with Boaz Barak (MSR) Part I: RecommendaIon systems and parially observed matrices THE NETFLIX PROBLEM movies

More information

Recovery of Simultaneously Structured Models using Convex Optimization

Recovery of Simultaneously Structured Models using Convex Optimization Recovery of Simultaneously Structured Models using Convex Optimization Maryam Fazel University of Washington Joint work with: Amin Jalali (UW), Samet Oymak and Babak Hassibi (Caltech) Yonina Eldar (Technion)

More information

SKA machine learning perspec1ves for imaging, processing and analysis

SKA machine learning perspec1ves for imaging, processing and analysis 1 SKA machine learning perspec1ves for imaging, processing and analysis Slava Voloshynovskiy Stochas1c Informa1on Processing Group University of Geneva Switzerland with contribu,on of: D. Kostadinov, S.

More information

LEARNING OVERCOMPLETE SPARSIFYING TRANSFORMS FOR SIGNAL PROCESSING. Saiprasad Ravishankar and Yoram Bresler

LEARNING OVERCOMPLETE SPARSIFYING TRANSFORMS FOR SIGNAL PROCESSING. Saiprasad Ravishankar and Yoram Bresler LEARNING OVERCOMPLETE SPARSIFYING TRANSFORMS FOR SIGNAL PROCESSING Saiprasad Ravishankar and Yoram Bresler Department of Electrical and Computer Engineering and the Coordinated Science Laboratory, University

More information

Equivalence Probability and Sparsity of Two Sparse Solutions in Sparse Representation

Equivalence Probability and Sparsity of Two Sparse Solutions in Sparse Representation IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 19, NO. 12, DECEMBER 2008 2009 Equivalence Probability and Sparsity of Two Sparse Solutions in Sparse Representation Yuanqing Li, Member, IEEE, Andrzej Cichocki,

More information

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017 Non-Convex Optimization CS6787 Lecture 7 Fall 2017 First some words about grading I sent out a bunch of grades on the course management system Everyone should have all their grades in Not including paper

More information

Using SVD to Recommend Movies

Using SVD to Recommend Movies Michael Percy University of California, Santa Cruz Last update: December 12, 2009 Last update: December 12, 2009 1 / Outline 1 Introduction 2 Singular Value Decomposition 3 Experiments 4 Conclusion Last

More information

Introduction to Sparsity. Xudong Cao, Jake Dreamtree & Jerry 04/05/2012

Introduction to Sparsity. Xudong Cao, Jake Dreamtree & Jerry 04/05/2012 Introduction to Sparsity Xudong Cao, Jake Dreamtree & Jerry 04/05/2012 Outline Understanding Sparsity Total variation Compressed sensing(definition) Exact recovery with sparse prior(l 0 ) l 1 relaxation

More information

An Introduction to Spectral Learning

An Introduction to Spectral Learning An Introduction to Spectral Learning Hanxiao Liu November 8, 2013 Outline 1 Method of Moments 2 Learning topic models using spectral properties 3 Anchor words Preliminaries X 1,, X n p (x; θ), θ = (θ 1,

More information

STAT 200C: High-dimensional Statistics

STAT 200C: High-dimensional Statistics STAT 200C: High-dimensional Statistics Arash A. Amini May 30, 2018 1 / 57 Table of Contents 1 Sparse linear models Basis Pursuit and restricted null space property Sufficient conditions for RNS 2 / 57

More information

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Bo Liu Department of Computer Science, Rutgers Univeristy Xiao-Tong Yuan BDAT Lab, Nanjing University of Information Science and Technology

More information

Lecture 6. Notes on Linear Algebra. Perceptron

Lecture 6. Notes on Linear Algebra. Perceptron Lecture 6. Notes on Linear Algebra. Perceptron COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Andrey Kan Copyright: University of Melbourne This lecture Notes on linear algebra Vectors

More information

Non-Convex Optimization in Machine Learning. Jan Mrkos AIC

Non-Convex Optimization in Machine Learning. Jan Mrkos AIC Non-Convex Optimization in Machine Learning Jan Mrkos AIC The Plan 1. Introduction 2. Non convexity 3. (Some) optimization approaches 4. Speed and stuff? Neural net universal approximation Theorem (1989):

More information

Lecture 5 : Projections

Lecture 5 : Projections Lecture 5 : Projections EE227C. Lecturer: Professor Martin Wainwright. Scribe: Alvin Wan Up until now, we have seen convergence rates of unconstrained gradient descent. Now, we consider a constrained minimization

More information

Applied Machine Learning for Biomedical Engineering. Enrico Grisan

Applied Machine Learning for Biomedical Engineering. Enrico Grisan Applied Machine Learning for Biomedical Engineering Enrico Grisan enrico.grisan@dei.unipd.it Data representation To find a representation that approximates elements of a signal class with a linear combination

More information

Sum-of-Squares and Spectral Algorithms

Sum-of-Squares and Spectral Algorithms Sum-of-Squares and Spectral Algorithms Tselil Schramm June 23, 2017 Workshop on SoS @ STOC 2017 Spectral algorithms as a tool for analyzing SoS. SoS Semidefinite Programs Spectral Algorithms SoS suggests

More information

15-850: Advanced Algorithms CMU, Fall 2018 HW #4 (out October 17, 2018) Due: October 28, 2018

15-850: Advanced Algorithms CMU, Fall 2018 HW #4 (out October 17, 2018) Due: October 28, 2018 15-850: Advanced Algorithms CMU, Fall 2018 HW #4 (out October 17, 2018) Due: October 28, 2018 Usual rules. :) Exercises 1. Lots of Flows. Suppose you wanted to find an approximate solution to the following

More information

Linear Methods for Regression. Lijun Zhang

Linear Methods for Regression. Lijun Zhang Linear Methods for Regression Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Linear Regression Models and Least Squares Subset Selection Shrinkage Methods Methods Using Derived

More information

Minimizing the Difference of L 1 and L 2 Norms with Applications

Minimizing the Difference of L 1 and L 2 Norms with Applications 1/36 Minimizing the Difference of L 1 and L 2 Norms with Department of Mathematical Sciences University of Texas Dallas May 31, 2017 Partially supported by NSF DMS 1522786 2/36 Outline 1 A nonconvex approach:

More information

https://goo.gl/kfxweg KYOTO UNIVERSITY Statistical Machine Learning Theory Sparsity Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT OF INTELLIGENCE SCIENCE AND TECHNOLOGY 1 KYOTO UNIVERSITY Topics:

More information

Computational statistics

Computational statistics Computational statistics Lecture 3: Neural networks Thierry Denœux 5 March, 2016 Neural networks A class of learning methods that was developed separately in different fields statistics and artificial

More information

Appendix A. Proof to Theorem 1

Appendix A. Proof to Theorem 1 Appendix A Proof to Theorem In this section, we prove the sample complexity bound given in Theorem The proof consists of three main parts In Appendix A, we prove perturbation lemmas that bound the estimation

More information

Summary and discussion of: Dropout Training as Adaptive Regularization

Summary and discussion of: Dropout Training as Adaptive Regularization Summary and discussion of: Dropout Training as Adaptive Regularization Statistics Journal Club, 36-825 Kirstin Early and Calvin Murdock November 21, 2014 1 Introduction Multi-layered (i.e. deep) artificial

More information

Self-Calibration and Biconvex Compressive Sensing

Self-Calibration and Biconvex Compressive Sensing Self-Calibration and Biconvex Compressive Sensing Shuyang Ling Department of Mathematics, UC Davis July 12, 2017 Shuyang Ling (UC Davis) SIAM Annual Meeting, 2017, Pittsburgh July 12, 2017 1 / 22 Acknowledgements

More information

1 Sparsity and l 1 relaxation

1 Sparsity and l 1 relaxation 6.883 Learning with Combinatorial Structure Note for Lecture 2 Author: Chiyuan Zhang Sparsity and l relaxation Last time we talked about sparsity and characterized when an l relaxation could recover the

More information