Simple, Efficient and Neural Algorithms for Sparse Coding
|
|
- Constance Oliver
- 5 years ago
- Views:
Transcription
1 Simple, Efficient and Neural Algorithms for Sparse Coding Ankur Moitra (MIT) joint work with Sanjeev Arora, Rong Ge and Tengyu Ma
2 B. A. Olshausen, D. J. Field. Emergence of simple- cell recepnve field propernes by learning a sparse code for natural images, 1996
3 B. A. Olshausen, D. J. Field. Emergence of simple- cell recepnve field propernes by learning a sparse code for natural images, 1996 break natural images into patches: (collecnon of vectors)
4 B. A. Olshausen, D. J. Field. Emergence of simple- cell recepnve field propernes by learning a sparse code for natural images, 1996 break natural images into patches: sparse coding (collecnon of vectors)
5 B. A. Olshausen, D. J. Field. Emergence of simple- cell recepnve field propernes by learning a sparse code for natural images, 1996 break natural images into patches: sparse coding (collecnon of vectors) Proper2es: localized, bandpass and oriented
6 B. A. Olshausen, D. J. Field. Emergence of simple- cell recepnve field propernes by learning a sparse code for natural images, 1996 break natural images into patches: (collecnon of vectors)
7 B. A. Olshausen, D. J. Field. Emergence of simple- cell recepnve field propernes by learning a sparse code for natural images, 1996 break natural images into patches: singular value decomposi2on (collecnon of vectors)
8 B. A. Olshausen, D. J. Field. Emergence of simple- cell recepnve field propernes by learning a sparse code for natural images, 1996 break natural images into patches: singular value decomposi2on Noisy! Difficult to interpret! (collecnon of vectors)
9 OUTLINE Are there efficient, neural algorithms for sparse coding with provable guarantees?
10 OUTLINE Are there efficient, neural algorithms for sparse coding with provable guarantees? Part I: The Olshausen- Field Update Rule A Non- convex FormulaNon Neural ImplementaNon A GeneraNve Model; Prior Work
11 OUTLINE Are there efficient, neural algorithms for sparse coding with provable guarantees? Part I: The Olshausen- Field Update Rule A Non- convex FormulaNon Neural ImplementaNon A GeneraNve Model; Prior Work Part II: A New Update Rule Online, Local and Hebbian with Provable Guarantees ConnecNons to Approximate Gradient Descent Further Extensions
12 More generally, many types of data are sparse in an appropriately chosen basis: b (i) e.g. images, signals, data (n p)
13 More generally, many types of data are sparse in an appropriately chosen basis: dicnonary (n m) A x (i) b (i) e.g. images, signals, representanons (m p) data (n p)
14 More generally, many types of data are sparse in an appropriately chosen basis: dicnonary (n m) at most k non- zeros A x (i) b (i) e.g. images, signals, representanons (m p) data (n p)
15 More generally, many types of data are sparse in an appropriately chosen basis: dicnonary (n m) at most k non- zeros A x (i) b (i) Sparse Coding/ Dic2onary Learning: Can we learn A from examples? e.g. images, signals, representanons (m p) data (n p)
16 NONCONVEX FORMULATIONS Usual approach, minimize reconstruc2on error: p p min A, x (i) s i = 1 b (i) - A x (i) + L(x (i) ) i = 1 non- linear penalty func2on
17 NONCONVEX FORMULATIONS Usual approach, minimize reconstruc2on error: p p min A, x (i) s i = 1 b (i) - A x (i) + L(x (i) ) i = 1 non- linear penalty func2on (encourage sparsity)
18 NONCONVEX FORMULATIONS Usual approach, minimize reconstruc2on error: p p min A, x (i) s i = 1 b (i) - A x (i) + L(x (i) ) i = 1 non- linear penalty func2on (encourage sparsity) This opnmizanon problem is NP- hard, can have many local opnma; but heuris2cs work well nevertheless
19 A NEURAL IMPLEMENTATION [Olshausen, Field]: output dicnonary stored as synapse weights residual image (snmulus) A i,j b j x i L (x i ) r j
20 A NEURAL IMPLEMENTATION [Olshausen, Field]: output dicnonary stored as synapse weights residual image (snmulus) b j
21 A NEURAL IMPLEMENTATION [Olshausen, Field]: output dicnonary stored as synapse weights residual image (snmulus) r j b j
22 A NEURAL IMPLEMENTATION [Olshausen, Field]: output dicnonary stored as synapse weights residual image (snmulus) r j A i,j b j
23 A NEURAL IMPLEMENTATION [Olshausen, Field]: output x i dicnonary stored as synapse weights residual image (snmulus) r j A i,j b j
24 A NEURAL IMPLEMENTATION [Olshausen, Field]: output L (x i ) dicnonary stored as synapse weights residual r j A i,j image (snmulus)
25 This network performs gradient descent on: 2 b - A x + L(x)
26 This network performs gradient descent on: 2 b - A x + L(x) by alternanng between (1) r b Ax (2) x x + η(a T r L(x)) Δ
27 This network performs gradient descent on: 2 b - A x + L(x) by alternanng between (1) r b Ax (2) x x + η(a T r L(x)) Moreover A is updated through Hebbian rules Δ
28 This network performs gradient descent on: 2 b - A x + L(x) by alternanng between (1) r b Ax (2) x x + η(a T r L(x)) Moreover A is updated through Hebbian rules Δ There are no provable guarantees, but works well
29 This network performs gradient descent on: 2 b - A x + L(x) by alternanng between (1) r b Ax (2) x x + η(a T r L(x)) Moreover A is updated through Hebbian rules Δ There are no provable guarantees, but works well But why should gradient descent on a non- convex funcnon work?
30 This network performs gradient descent on: 2 b - A x + L(x) by alternanng between (1) r b Ax (2) x x + η(a T r L(x)) Moreover A is updated through Hebbian rules Δ There are no provable guarantees, but works well But why should gradient descent on a non- convex funcnon work? Are simple, local and Hebbian rules sufficient to find globally opnmal solunons?
31 OTHER APPROACHES, AND APPLICATIONS
32 OTHER APPROACHES, AND APPLICATIONS Signal Processing/Sta2s2cs (MOD, ksvd): De- noising, edge- detecnon, super- resolunon Block compression for images/video
33 OTHER APPROACHES, AND APPLICATIONS Signal Processing/Sta2s2cs (MOD, ksvd): De- noising, edge- detecnon, super- resolunon Block compression for images/video Machine Learning (LBRN06, ): Sparsity as a regularizer to prevent over- fimng Learned sparse reps. play a key role in deep- learning
34 OTHER APPROACHES, AND APPLICATIONS Signal Processing/Sta2s2cs (MOD, ksvd): De- noising, edge- detecnon, super- resolunon Block compression for images/video Machine Learning (LBRN06, ): Sparsity as a regularizer to prevent over- fimng Learned sparse reps. play a key role in deep- learning Theore2cal Computer Science (SWW13, AGM14, AAJNT14): New algorithms with provable guarantees, in a natural genera2ve model
35 Genera2ve Model: unknown dicnonary A generate x with support of size k u.a.r., choose non- zero values independently, observe b = Ax
36 Genera2ve Model: unknown dicnonary A generate x with support of size k u.a.r., choose non- zero values independently, observe b = Ax [Spielman, Wang, Wright 13]: works for full coln rank A up to sparsity roughly n ½ (hence m n)
37 Genera2ve Model: unknown dicnonary A generate x with support of size k u.a.r., choose non- zero values independently, observe b = Ax [Spielman, Wang, Wright 13]: works for full coln rank A up to sparsity roughly n ½ (hence m n) [Arora, Ge, Moitra 14]: works for overcomplete, μ- incoherent A up to sparsity roughly n ½- ε /μ
38 Genera2ve Model: unknown dicnonary A generate x with support of size k u.a.r., choose non- zero values independently, observe b = Ax [Spielman, Wang, Wright 13]: works for full coln rank A up to sparsity roughly n ½ (hence m n) [Arora, Ge, Moitra 14]: works for overcomplete, μ- incoherent A up to sparsity roughly n ½- ε /μ [Agarwal et al. 14]: works for overcomplete, μ- incoherent A up to sparsity roughly n ¼ /μ, via alternanng minimizanon
39 Genera2ve Model: unknown dicnonary A generate x with support of size k u.a.r., choose non- zero values independently, observe b = Ax [Spielman, Wang, Wright 13]: works for full coln rank A up to sparsity roughly n ½ (hence m n) [Arora, Ge, Moitra 14]: works for overcomplete, μ- incoherent A up to sparsity roughly n ½- ε /μ [Agarwal et al. 14]: works for overcomplete, μ- incoherent A up to sparsity roughly n ¼ /μ, via alternanng minimizanon [Barak, Kelner, Steurer 14]: works for overcomplete A up to sparsity roughly n 1- ε, but running Nme is exponen2al in accuracy
40 OUR RESULTS Suppose k n/μ polylog(n) and A n polylog(n) Suppose A that is column- wise δ- close to A for δ 1/polylog(n)
41 OUR RESULTS Suppose k n/μ polylog(n) and A n polylog(n) Suppose A that is column- wise δ- close to A for δ 1/polylog(n) Theorem [Arora, Ge, Ma, Moitra 14]: There is a variant of the OF- update rule that converges to the true dicnonary at a geometric rate, and uses a polynomial number of samples
42 OUR RESULTS Suppose k n/μ polylog(n) and A n polylog(n) Suppose A that is column- wise δ- close to A for δ 1/polylog(n) Theorem [Arora, Ge, Ma, Moitra 14]: There is a variant of the OF- update rule that converges to the true dicnonary at a geometric rate, and uses a polynomial number of samples All previous algorithms had subopnmal sparsity, worked in less generality, or were exponen2al in a natural parameter
43 OUR RESULTS Suppose k n/μ polylog(n) and A n polylog(n) Suppose A that is column- wise δ- close to A for δ 1/polylog(n) Theorem [Arora, Ge, Ma, Moitra 14]: There is a variant of the OF- update rule that converges to the true dicnonary at a geometric rate, and uses a polynomial number of samples All previous algorithms had subopnmal sparsity, worked in less generality, or were exponen2al in a natural parameter Note: k n/2μ is a barrier, even for sparse recovery
44 OUR RESULTS Suppose k n/μ polylog(n) and A n polylog(n) Suppose A that is column- wise δ- close to A for δ 1/polylog(n) Theorem [Arora, Ge, Ma, Moitra 14]: There is a variant of the OF- update rule that converges to the true dicnonary at a geometric rate, and uses a polynomial number of samples All previous algorithms had subopnmal sparsity, worked in less generality, or were exponen2al in a natural parameter Note: k n/2μ is a barrier, even for sparse recovery i.e. if k > n/2μ, then x is not necessarily the sparsest soln to Ax = b
45 OUTLINE Are there efficient, neural algorithms for sparse coding with provable guarantees? Part I: The Olshausen- Field Update Rule A Non- convex FormulaNon Neural ImplementaNon A GeneraNve Model; Prior Work Part II: A New Update Rule Online, Local and Hebbian with Provable Guarantees ConnecNons to Approximate Gradient Descent Further Extensions
46 A NEW UPDATE RULE Alternate between the following steps (size q batches): (1) (2) x (i) = threshold(a T b (i) ) q A A + η (b (i) Ax (i) )sgn(x (i) ) T i = 1
47 A NEW UPDATE RULE Alternate between the following steps (size q batches): (1) (2) x (i) = threshold(a T b (i) ) q (zero out small entries) A A + η (b (i) Ax (i) )sgn(x (i) ) T i = 1
48 A NEW UPDATE RULE Alternate between the following steps (size q batches): (1) (2) x (i) = threshold(a T b (i) ) q A A + η (b (i) Ax (i) )sgn(x (i) ) T i = 1
49 A NEW UPDATE RULE Alternate between the following steps (size q batches): (1) (2) x (i) = threshold(a T b (i) ) q A A + η (b (i) Ax (i) )sgn(x (i) ) T i = 1 The samples arrive online
50 A NEW UPDATE RULE Alternate between the following steps (size q batches): (1) (2) x (i) = threshold(a T b (i) ) q A A + η (b (i) Ax (i) )sgn(x (i) ) T i = 1 The samples arrive online In contrast, previous (provable) algorithms might need to compute a new esnmate from scratch, when new samples arrive
51 A NEW UPDATE RULE Alternate between the following steps (size q batches): (1) (2) x (i) = threshold(a T b (i) ) q A A + η (b (i) Ax (i) )sgn(x (i) ) T i = 1
52 A NEW UPDATE RULE Alternate between the following steps (size q batches): (1) (2) x (i) = threshold(a T b (i) ) q A A + η (b (i) Ax (i) )sgn(x (i) ) T i = 1 The computanon is local
53 A NEW UPDATE RULE Alternate between the following steps (size q batches): (1) (2) x (i) = threshold(a T b (i) ) q A A + η (b (i) Ax (i) )sgn(x (i) ) T i = 1 The computanon is local In parncular, the output is a thresholded, weighted sum of acnvanons
54 A NEW UPDATE RULE Alternate between the following steps (size q batches): (1) (2) x (i) = threshold(a T b (i) ) q A A + η (b (i) Ax (i) )sgn(x (i) ) T i = 1
55 A NEW UPDATE RULE Alternate between the following steps (size q batches): (1) (2) x (i) = threshold(a T b (i) ) q A A + η (b (i) Ax (i) )sgn(x (i) ) T i = 1 The update rule is explicitly Hebbian
56 A NEW UPDATE RULE Alternate between the following steps (size q batches): (1) (2) x (i) = threshold(a T b (i) ) q A A + η (b (i) Ax (i) )sgn(x (i) ) T i = 1 The update rule is explicitly Hebbian neurons that fire together, wire together
57 A NEW UPDATE RULE Alternate between the following steps (size q batches): (1) (2) x (i) = threshold(a T b (i) ) q A A + η (b (i) Ax (i) )sgn(x (i) ) T i = 1 The update rule is explicitly Hebbian
58 A NEW UPDATE RULE Alternate between the following steps (size q batches): (1) (2) x (i) = threshold(a T b (i) ) q A A + η (b (i) Ax (i) )sgn(x (i) ) T i = 1 The update rule is explicitly Hebbian The update to a weight A i,j is the product of the acnvanons at the residual layer and the decoding layer
59 WHAT IS NEURALLY PLAUSIBLE, ANYWAYS? Our update rule (essennally) inherits a neural implementanon from [Olshausen, Field]
60 WHAT IS NEURALLY PLAUSIBLE, ANYWAYS? Our update rule (essennally) inherits a neural implementanon from [Olshausen, Field] However there are many compenng theories for what consntutes a plausible neural implementanon e.g. nonneganve outputs, no bidirecnonal links, etc
61 WHAT IS NEURALLY PLAUSIBLE, ANYWAYS? Our update rule (essennally) inherits a neural implementanon from [Olshausen, Field] However there are many compenng theories for what consntutes a plausible neural implementanon e.g. nonneganve outputs, no bidirecnonal links, etc But ours is online, local and Hebbian, all of which are basic propernes to require The surprise is that such simple building blocks can find globally op2mal solunons to highly non- trivial algorithmic problems!
62 APPROXIMATE GRADIENT DESCENT We give a general framework for designing and analyzing iteranve algorithms for sparse coding
63 APPROXIMATE GRADIENT DESCENT We give a general framework for designing and analyzing iteranve algorithms for sparse coding The usual approach is to think of them as trying to minimize a non- convex funcnon: min A, coln- sparse X E(A, X) = B - A X 2 F
64 APPROXIMATE GRADIENT DESCENT We give a general framework for designing and analyzing iteranve algorithms for sparse coding The usual approach is to think of them as trying to minimize a non- convex funcnon: min A, coln- sparse X E(A, X) = B - A X 2 F colns are b (i) s colns are x (i) s
65 APPROXIMATE GRADIENT DESCENT We give a general framework for designing and analyzing iteranve algorithms for sparse coding How about thinking of them as trying to minimize an unknown, convex funcnon?
66 APPROXIMATE GRADIENT DESCENT We give a general framework for designing and analyzing iteranve algorithms for sparse coding How about thinking of them as trying to minimize an unknown, convex funcnon? min A E(A, X) = B - A X 2 F
67 APPROXIMATE GRADIENT DESCENT We give a general framework for designing and analyzing iteranve algorithms for sparse coding How about thinking of them as trying to minimize an unknown, convex funcnon? min A E(A, X) = B - A X 2 F Now the funcnon is strongly convex, and has a global opnmum that can be reached by gradient descent!
68 APPROXIMATE GRADIENT DESCENT We give a general framework for designing and analyzing iteranve algorithms for sparse coding How about thinking of them as trying to minimize an unknown, convex funcnon? min A E(A, X) = B - A X 2 F Now the funcnon is strongly convex, and has a global opnmum that can be reached by gradient descent! New Goal: Prove that (with high probability) the step (2) approximates the gradient of this funcnon
69 CONDITIONS FOR CONVERGENCE
70 CONDITIONS FOR CONVERGENCE Consider the following general setup: op2mal solu2on: z * update: z s+1 = z s η g s
71 CONDITIONS FOR CONVERGENCE Consider the following general setup: op2mal solu2on: z * update: z s+1 = z s η g s Defini2on: g s is (α, β, ε s )- correlated with z * if for all s: g s,z s - z * α z s - z * β g s - ε s
72 CONDITIONS FOR CONVERGENCE Consider the following general setup: op2mal solu2on: z * update: z s+1 = z s η g s Defini2on: g s is (α, β, ε s )- correlated with z * if for all s: g s,z s - z * α z s - z * β g s - ε s Theorem: If g s is (α, β, ε s )- correlated with z *, then z s - z * 2 (1-2αη) s z 0 - z * 2 + max s ε s α
73 CONDITIONS FOR CONVERGENCE Consider the following general setup: op2mal solu2on: z * update: z s+1 = z s η g s Defini2on: g s is (α, β, ε s )- correlated with z * if for all s: g s,z s - z * α z s - z * β g s - ε s Theorem: If g s is (α, β, ε s )- correlated with z *, then z s - z * 2 (1-2αη) s z 0 - z * 2 + max s ε s α This follows immediately from the usual proof
74 (1) x (i) = threshold(a T b (i) )
75 (1) x (i) = threshold(a T b (i) ) Decoding Lemma: If A is 1/polylog(n)- close to A and A A 2, then decoding recovers the signs correctly (whp)
76 (1) x (i) = threshold(a T b (i) ) Decoding Lemma: If A is 1/polylog(n)- close to A and A A 2, then decoding recovers the signs correctly (whp) (2) q A A + η (b (i) Ax (i) )sgn(x (i) ) T i =1
77 (1) x (i) = threshold(a T b (i) ) Decoding Lemma: If A is 1/polylog(n)- close to A and A A 2, then decoding recovers the signs correctly (whp) (2) q A A + η (b (i) Ax (i) )sgn(x (i) ) T i =1 Key Lemma: ExpectaNon of (the column- wise) update rule is A j A j + ξ (I - A j A T j )A j + ξ E [A R A R T ]A j + error R where R = supp(x)\j, if decoding recovers the correct signs
78 (1) x (i) = threshold(a T b (i) ) Decoding Lemma: If A is 1/polylog(n)- close to A and A A 2, then decoding recovers the signs correctly (whp) (2) q A A + η (b (i) Ax (i) )sgn(x (i) ) T i =1 Key Lemma: ExpectaNon of (the column- wise) update rule is A j A j + ξ (I - A j A T j )A j + ξ E [A R A R T ]A j + error A j - A j where R = supp(x)\j, if decoding recovers the correct signs R
79 (1) x (i) = threshold(a T b (i) ) Decoding Lemma: If A is 1/polylog(n)- close to A and A A 2, then decoding recovers the signs correctly (whp) (2) q A A + η (b (i) Ax (i) )sgn(x (i) ) T i =1 Key Lemma: ExpectaNon of (the column- wise) update rule is A j A j + ξ (I - A j A T j )A j + ξ E [A R A R T ]A j + error A j - A j systemic bias where R = supp(x)\j, if decoding recovers the correct signs R
80 (1) x (i) = threshold(a T b (i) ) Decoding Lemma: If A is 1/polylog(n)- close to A and A A 2, then decoding recovers the signs correctly (whp) (2) q A A + η (b (i) Ax (i) )sgn(x (i) ) T i =1 Key Lemma: ExpectaNon of (the column- wise) update rule is A j A j + ξ (I - A j A T j )A j + ξ E [A R A R T ]A j + error A j - A j systemic bias where R = supp(x)\j, if decoding recovers the correct signs Auxiliary Lemma: A A 2, remains true throughout if η is small enough and q is large enough R
81 Proof: Let ζ denote any vector whose norm is negligible (say, n - ω(1) ).
82 Proof: Let ζ denote any vector whose norm is negligible (say, n - ω(1) ). g j = E[(b Ax)sgn(x j )] is the expected update to A j.
83 Proof: Let ζ denote any vector whose norm is negligible (say, n - ω(1) ). g j = E[(b Ax)sgn(x j )] is the expected update to A j. Let 1 F be the indicator of the event that decoding recovers the signs of x.
84 Proof: Let ζ denote any vector whose norm is negligible (say, n - ω(1) ). g j = E[(b Ax)sgn(x j )] is the expected update to A j. Let 1 F be the indicator of the event that decoding recovers the signs of x. g j = E[(b Ax)sgn(x j ) 1 F ] + E[(b Ax)sgn(x j ) 1 F ]
85 Proof: Let ζ denote any vector whose norm is negligible (say, n - ω(1) ). g j = E[(b Ax)sgn(x j )] is the expected update to A j. Let 1 F be the indicator of the event that decoding recovers the signs of x. g j = E[(b Ax)sgn(x j ) 1 F ] ± ζ
86 Proof: Let ζ denote any vector whose norm is negligible (say, n - ω(1) ). g j = E[(b Ax)sgn(x j )] is the expected update to A j. Let 1 F be the indicator of the event that decoding recovers the signs of x. g j = E[(b Ax)sgn(x j ) 1 F ] ± ζ
87 Proof: Let ζ denote any vector whose norm is negligible (say, n - ω(1) ). g j = E[(b Ax)sgn(x j )] is the expected update to A j. Let 1 F be the indicator of the event that decoding recovers the signs of x. g j = E[(b A threshold(a T b)) sgn(x j ) 1 F ] ± ζ
88 Proof: Let ζ denote any vector whose norm is negligible (say, n - ω(1) ). g j = E[(b Ax)sgn(x j )] is the expected update to A j. Let 1 F be the indicator of the event that decoding recovers the signs of x. Let S = supp(x). T g j = E[(b A S A S b) sgn(x j ) 1 F ] ± ζ
89 Proof: Let ζ denote any vector whose norm is negligible (say, n - ω(1) ). g j = E[(b Ax)sgn(x j )] is the expected update to A j. Let 1 F be the indicator of the event that decoding recovers the signs of x. Let S = supp(x). T g j = E[(b A S A S b) sgn(x j )] T E[(b A S A S b) sgn(x j ) 1 F ] ± ζ
90 Proof: Let ζ denote any vector whose norm is negligible (say, n - ω(1) ). g j = E[(b Ax)sgn(x j )] is the expected update to A j. Let 1 F be the indicator of the event that decoding recovers the signs of x. Let S = supp(x). T g j = E[(I A S A S )Ax sgn(x j )] ± ζ
91 Proof: Let ζ denote any vector whose norm is negligible (say, n - ω(1) ). g j = E[(b Ax)sgn(x j )] is the expected update to A j. Let 1 F be the indicator of the event that decoding recovers the signs of x. Let S = supp(x). T g j = E[(I A S A S )Ax sgn(x j )] ± ζ T = E S E x [[(I A S A S )Ax sgn(x j )] S] ± ζ S
92 Proof: Let ζ denote any vector whose norm is negligible (say, n - ω(1) ). g j = E[(b Ax)sgn(x j )] is the expected update to A j. Let 1 F be the indicator of the event that decoding recovers the signs of x. Let S = supp(x). T g j = E[(I A S A S )Ax sgn(x j )] ± ζ T = E S E x [[(I A S A S )Ax sgn(x j )] S] ± ζ S T = p j E S [(I A S A S )A j ] ± ζ where p j = E[x j sgn(x j ) j in S].
93 Proof: Let ζ denote any vector whose norm is negligible (say, n - ω(1) ). g j = E[(b Ax)sgn(x j )] is the expected update to A j. Let 1 F be the indicator of the event that decoding recovers the signs of x. Let S = supp(x). T g j = E[(I A S A S )Ax sgn(x j )] ± ζ T = E S E x [[(I A S A S )Ax sgn(x j )] S] ± ζ S T = p j E S [(I A S A S )A j ] ± ζ where p j = E[x j sgn(x j ) j in S]. Let q j = Pr[j in S], q i,j = Pr[i,j in S]
94 Proof: Let ζ denote any vector whose norm is negligible (say, n - ω(1) ). g j = E[(b Ax)sgn(x j )] is the expected update to A j. Let 1 F be the indicator of the event that decoding recovers the signs of x. Let S = supp(x). T g j = E[(I A S A S )Ax sgn(x j )] ± ζ T = E S E x [[(I A S A S )Ax sgn(x j )] S] ± ζ S T = p j E S [(I A S A S )A j ] ± ζ where p j = E[x j sgn(x j ) j in S]. Let q j = Pr[j in S], q i,j = Pr[i,j in S] then T T = p j q j (I A j A j )A j + p j A - j diag(q i,j )A - j A j ± ζ
95 Proof: Let ζ denote any vector whose norm is negligible (say, n - ω(1) ). g j = E[(b Ax)sgn(x j )] is the expected update to A j. Let 1 F be the indicator of the event that decoding recovers the signs of x. Let S = supp(x). T g j = E[(I A S A S )Ax sgn(x j )] ± ζ T = E S E x [[(I A S A S )Ax sgn(x j )] S] ± ζ S T = p j E S [(I A S A S )A j ] ± ζ where p j = E[x j sgn(x j ) j in S]. Let q j = Pr[j in S], q i,j = Pr[i,j in S] then T T = p j q j (I A j A j )A j + p j A - j diag(q i,j )A - j A j ± ζ
96 AN INITIALIZATION PROCEDURE We give an ininalizanon algorithm that outputs A that is column- wise δ- close to A for δ 1/polylog(n), A A 2
97 AN INITIALIZATION PROCEDURE We give an ininalizanon algorithm that outputs A that is column- wise δ- close to A for δ 1/polylog(n), A A 2 Repeat: (1) Choose samples b, b
98 AN INITIALIZATION PROCEDURE We give an ininalizanon algorithm that outputs A that is column- wise δ- close to A for δ 1/polylog(n), A A 2 Repeat: (1) Choose samples b, b (2) (b T b (i) ) (b T b (i) ) b (i) (b (i) ) T Set M b,b = 1 q q i = 1
99 AN INITIALIZATION PROCEDURE We give an ininalizanon algorithm that outputs A that is column- wise δ- close to A for δ 1/polylog(n), A A 2 Repeat: (1) Choose samples b, b (2) (b T b (i) ) (b T b (i) ) b (i) (b (i) ) T (3) Set M b,b = 1 q q i = 1 k If λ 1 (M b,b ) > and λ m 2 << output top eigenvector k m logm
100 AN INITIALIZATION PROCEDURE We give an ininalizanon algorithm that outputs A that is column- wise δ- close to A for δ 1/polylog(n), A A 2 Repeat: (1) Choose samples b, b (2) (b T b (i) ) (b T b (i) ) b (i) (b (i) ) T (3) Set M b,b = 1 q q i = 1 k If λ 1 (M b,b ) > and λ m 2 << output top eigenvector k m logm Key Lemma: If Ax = b and Ax = b, then condinon (3) is sansfied if and only if supp(x) supp(x ) = {j}
101 AN INITIALIZATION PROCEDURE We give an ininalizanon algorithm that outputs A that is column- wise δ- close to A for δ 1/polylog(n), A A 2 Repeat: (1) Choose samples b, b (2) (b T b (i) ) (b T b (i) ) b (i) (b (i) ) T (3) Set M b,b = 1 q q i = 1 k If λ 1 (M b,b ) > and λ m 2 << output top eigenvector k m logm Key Lemma: If Ax = b and Ax = b, then condinon (3) is sansfied if and only if supp(x) supp(x ) = {j} in which case, the top eigenvector is δ- close to A j
102 DISCUSSION Our ininalizanon gets us to δ 1/polylog(n), can be neurally implemented with Oja s Rule
103 DISCUSSION Our ininalizanon gets us to δ 1/polylog(n), can be neurally implemented with Oja s Rule Earlier analyses of alternanng minimizanon for δ 1/poly(n) in [Arora, Ge, Moitra 14] and [Agarwal et al 14]
104 DISCUSSION Our ininalizanon gets us to δ 1/polylog(n), can be neurally implemented with Oja s Rule Earlier analyses of alternanng minimizanon for δ 1/poly(n) in [Arora, Ge, Moitra 14] and [Agarwal et al 14] However, in those semngs A and A are so close that the objecnve funcnon is essen2ally convex
105 DISCUSSION Our ininalizanon gets us to δ 1/polylog(n), can be neurally implemented with Oja s Rule Earlier analyses of alternanng minimizanon for δ 1/poly(n) in [Arora, Ge, Moitra 14] and [Agarwal et al 14] However, in those semngs A and A are so close that the objecnve funcnon is essen2ally convex We show that it converges even from mild starnng condinons
106 DISCUSSION Our ininalizanon gets us to δ 1/polylog(n), can be neurally implemented with Oja s Rule Earlier analyses of alternanng minimizanon for δ 1/poly(n) in [Arora, Ge, Moitra 14] and [Agarwal et al 14] However, in those semngs A and A are so close that the objecnve funcnon is essen2ally convex We show that it converges even from mild starnng condinons As a result, our bounds improve on exisnng algorithms in terms of running 2me, sample complexity and sparsity (all but SOS)
107 FURTHER RESULTS AdjusNng an iteranve alg. can have subtle effects on its behavior
108 FURTHER RESULTS AdjusNng an iteranve alg. can have subtle effects on its behavior We can use our framework to systema2cally design/analyze new update rules
109 FURTHER RESULTS AdjusNng an iteranve alg. can have subtle effects on its behavior We can use our framework to systema2cally design/analyze new update rules E.g. we can remove the systemic bias, by carefully projecnng out along the direcnon being updated
110 FURTHER RESULTS AdjusNng an iteranve alg. can have subtle effects on its behavior We can use our framework to systema2cally design/analyze new update rules E.g. we can remove the systemic bias, by carefully projecnng out along the direcnon being updated x (i) = threshold(c T b (i) ) (1) j j where C j = [Proj A (A 1 ) j,proj A (A 2 ), A j Proj A (A m )] j j
111 FURTHER RESULTS AdjusNng an iteranve alg. can have subtle effects on its behavior We can use our framework to systema2cally design/analyze new update rules E.g. we can remove the systemic bias, by carefully projecnng out along the direcnon being updated (1) x (i) j = threshold(c T j b (i) ) where C j = [Proj A (A 1 ) j,proj A (A 2 ), A j Proj A (A m )] j (2) q A A + η (b (i) C x (i) )sgn(x (i) ) T j j j j j i = 1 j
112 Any QuesNons? Summary: Online, local and Hebbian algorithms for sparse coding that find a globally opnmal solunon (whp) Introduced a framework for analyzing iteranve algorithms by thinking of them as trying to minimize an unknown, convex funcnon The key is working with a generanve model Is computa2onal intractability really a barrier to a rigorous theory of neural computanon?
Simple, Efficient and Neural Algorithms for Sparse Coding
Simple, Efficient and Neural Algorithms for Sparse Coding Ankur Moitra (MIT) joint work with Sanjeev Arora, Rong Ge and Tengyu Ma Algorithmic Aspects of Machine Learning (c) 2015 by Ankur Moitra. Note:
More informationSimple, Efficient, and Neural Algorithms for Sparse Coding
Sanjeev Arora Princeton University, Computer Science Department Rong Ge Microsoft Research Tengyu Ma Princeton University, Computer Science Department Ankur Moitra MIT, Department of Mathematics and CSAIL
More informationProvable Alternating Minimization Methods for Non-convex Optimization
Provable Alternating Minimization Methods for Non-convex Optimization Prateek Jain Microsoft Research, India Joint work with Praneeth Netrapalli, Sujay Sanghavi, Alekh Agarwal, Animashree Anandkumar, Rashish
More informationCompressed Sensing and Neural Networks
and Jan Vybíral (Charles University & Czech Technical University Prague, Czech Republic) NOMAD Summer Berlin, September 25-29, 2017 1 / 31 Outline Lasso & Introduction Notation Training the network Applications
More informationWhy Sparse Coding Works
Why Sparse Coding Works Mathematical Challenges in Deep Learning Shaowei Lin (UC Berkeley) shaowei@math.berkeley.edu 10 Aug 2011 Deep Learning Kickoff Meeting What is Sparse Coding? There are many formulations
More informationSum-of-Squares Method, Tensor Decomposition, Dictionary Learning
Sum-of-Squares Method, Tensor Decomposition, Dictionary Learning David Steurer Cornell Approximation Algorithms and Hardness, Banff, August 2014 for many problems (e.g., all UG-hard ones): better guarantees
More informationTowards Learning Sparsely Used Dictionaries with Arbitrary Supports
2018 IEEE 59th Annual Symposium on Foundations of Computer Science Towards Learning Sparsely Used Dictionaries with Arbitrary Supports Pranjal Awasthi Department of Computer Science utgers University Email:
More informationAlgorithms Reading Group Notes: Provable Bounds for Learning Deep Representations
Algorithms Reading Group Notes: Provable Bounds for Learning Deep Representations Joshua R. Wang November 1, 2016 1 Model and Results Continuing from last week, we again examine provable algorithms for
More informationStructured matrix factorizations. Example: Eigenfaces
Structured matrix factorizations Example: Eigenfaces An extremely large variety of interesting and important problems in machine learning can be formulated as: Given a matrix, find a matrix and a matrix
More informationapproximation algorithms I
SUM-OF-SQUARES method and approximation algorithms I David Steurer Cornell Cargese Workshop, 201 meta-task encoded as low-degree polynomial in R x example: f(x) = i,j n w ij x i x j 2 given: functions
More informationarxiv: v1 [cs.it] 26 Oct 2018
Outlier Detection using Generative Models with Theoretical Performance Guarantees arxiv:1810.11335v1 [cs.it] 6 Oct 018 Jirong Yi Anh Duc Le Tianming Wang Xiaodong Wu Weiyu Xu October 9, 018 Abstract This
More informationRapid, Robust, and Reliable Blind Deconvolution via Nonconvex Optimization
Rapid, Robust, and Reliable Blind Deconvolution via Nonconvex Optimization Shuyang Ling Department of Mathematics, UC Davis Oct.18th, 2016 Shuyang Ling (UC Davis) 16w5136, Oaxaca, Mexico Oct.18th, 2016
More informationA Provable Approach for Double-Sparse Coding
A Provable Approach for Double-Sparse Coding Thanh V. Nguyen ECE Department Iowa State University thanhng@iastate.edu Raymond K. W. Wong Statistics Department Texas A&M University raywong@tamu.edu Chinmay
More informationCSC 576: Variants of Sparse Learning
CSC 576: Variants of Sparse Learning Ji Liu Department of Computer Science, University of Rochester October 27, 205 Introduction Our previous note basically suggests using l norm to enforce sparsity in
More informationMachine Learning for Signal Processing Sparse and Overcomplete Representations
Machine Learning for Signal Processing Sparse and Overcomplete Representations Abelino Jimenez (slides from Bhiksha Raj and Sourish Chaudhuri) Oct 1, 217 1 So far Weights Data Basis Data Independent ICA
More informationA Practical Algorithm for Topic Modeling with Provable Guarantees
1 A Practical Algorithm for Topic Modeling with Provable Guarantees Sanjeev Arora, Rong Ge, Yoni Halpern, David Mimno, Ankur Moitra, David Sontag, Yichen Wu, Michael Zhu Reviewed by Zhao Song December
More informationSPARSE signal representations have gained popularity in recent
6958 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 10, OCTOBER 2011 Blind Compressed Sensing Sivan Gleichman and Yonina C. Eldar, Senior Member, IEEE Abstract The fundamental principle underlying
More informationOptimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison
Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big
More informationCompressed sensing. Or: the equation Ax = b, revisited. Terence Tao. Mahler Lecture Series. University of California, Los Angeles
Or: the equation Ax = b, revisited University of California, Los Angeles Mahler Lecture Series Acquiring signals Many types of real-world signals (e.g. sound, images, video) can be viewed as an n-dimensional
More informationsparse and low-rank tensor recovery Cubic-Sketching
Sparse and Low-Ran Tensor Recovery via Cubic-Setching Guang Cheng Department of Statistics Purdue University www.science.purdue.edu/bigdata CCAM@Purdue Math Oct. 27, 2017 Joint wor with Botao Hao and Anru
More informationMLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT
MLCC 2018 Variable Selection and Sparsity Lorenzo Rosasco UNIGE-MIT-IIT Outline Variable Selection Subset Selection Greedy Methods: (Orthogonal) Matching Pursuit Convex Relaxation: LASSO & Elastic Net
More informationRobust Principal Component Analysis
ELE 538B: Mathematics of High-Dimensional Data Robust Principal Component Analysis Yuxin Chen Princeton University, Fall 2018 Disentangling sparse and low-rank matrices Suppose we are given a matrix M
More informationNon-convex Robust PCA: Provable Bounds
Non-convex Robust PCA: Provable Bounds Anima Anandkumar U.C. Irvine Joint work with Praneeth Netrapalli, U.N. Niranjan, Prateek Jain and Sujay Sanghavi. Learning with Big Data High Dimensional Regime Missing
More informationarxiv: v2 [stat.ml] 12 Dec 2017
Provably Accurate Double-Sparse Coding arxiv:1711.03638v2 [stat.ml] 12 Dec 2017 Thanh V. Nguyen Iowa State University, ECE Department Raymond K. W. Wong Texas A&M University, Statistics Department Chinmay
More informationRecovery of Sparse Signals from Noisy Measurements Using an l p -Regularized Least-Squares Algorithm
Recovery of Sparse Signals from Noisy Measurements Using an l p -Regularized Least-Squares Algorithm J. K. Pant, W.-S. Lu, and A. Antoniou University of Victoria August 25, 2011 Compressive Sensing 1 University
More informationElaine T. Hale, Wotao Yin, Yin Zhang
, Wotao Yin, Yin Zhang Department of Computational and Applied Mathematics Rice University McMaster University, ICCOPT II-MOPTA 2007 August 13, 2007 1 with Noise 2 3 4 1 with Noise 2 3 4 1 with Noise 2
More informationLinear dimensionality reduction for data analysis
Linear dimensionality reduction for data analysis Nicolas Gillis Joint work with Robert Luce, François Glineur, Stephen Vavasis, Robert Plemmons, Gabriella Casalino The setup Dimensionality reduction for
More information(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann
(Feed-Forward) Neural Networks 2016-12-06 Dr. Hajira Jabeen, Prof. Jens Lehmann Outline In the previous lectures we have learned about tensors and factorization methods. RESCAL is a bilinear model for
More informationRecoverabilty Conditions for Rankings Under Partial Information
Recoverabilty Conditions for Rankings Under Partial Information Srikanth Jagabathula Devavrat Shah Abstract We consider the problem of exact recovery of a function, defined on the space of permutations
More informationDictionary Learning Using Tensor Methods
Dictionary Learning Using Tensor Methods Anima Anandkumar U.C. Irvine Joint work with Rong Ge, Majid Janzamin and Furong Huang. Feature learning as cornerstone of ML ML Practice Feature learning as cornerstone
More informationSolving Corrupted Quadratic Equations, Provably
Solving Corrupted Quadratic Equations, Provably Yuejie Chi London Workshop on Sparse Signal Processing September 206 Acknowledgement Joint work with Yuanxin Li (OSU), Huishuai Zhuang (Syracuse) and Yingbin
More informationNew Coherence and RIP Analysis for Weak. Orthogonal Matching Pursuit
New Coherence and RIP Analysis for Wea 1 Orthogonal Matching Pursuit Mingrui Yang, Member, IEEE, and Fran de Hoog arxiv:1405.3354v1 [cs.it] 14 May 2014 Abstract In this paper we define a new coherence
More informationIntroduction to Compressed Sensing
Introduction to Compressed Sensing Alejandro Parada, Gonzalo Arce University of Delaware August 25, 2016 Motivation: Classical Sampling 1 Motivation: Classical Sampling Issues Some applications Radar Spectral
More informationCAREER: Algorithmic Aspects of Machine Learning
Proposal Summary Over the past few decades an uncomfortable truth has set in that worst case analysis is not the right framework for studying machine learning: every model that is interesting enough to
More informationECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference
ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Low-rank matrix recovery via nonconvex optimization Yuejie Chi Department of Electrical and Computer Engineering Spring
More informationMathematical introduction to Compressed Sensing
Mathematical introduction to Compressed Sensing Lesson 1 : measurements and sparsity Guillaume Lecué ENSAE Mardi 31 janvier 2016 Guillaume Lecué (ENSAE) Compressed Sensing Mardi 31 janvier 2016 1 / 31
More informationSketching for Large-Scale Learning of Mixture Models
Sketching for Large-Scale Learning of Mixture Models Nicolas Keriven Université Rennes 1, Inria Rennes Bretagne-atlantique Adv. Rémi Gribonval Outline Introduction Practical Approach Results Theoretical
More informationRecent Developments in Compressed Sensing
Recent Developments in Compressed Sensing M. Vidyasagar Distinguished Professor, IIT Hyderabad m.vidyasagar@iith.ac.in, www.iith.ac.in/ m vidyasagar/ ISL Seminar, Stanford University, 19 April 2018 Outline
More informationStochastic Composition Optimization
Stochastic Composition Optimization Algorithms and Sample Complexities Mengdi Wang Joint works with Ethan X. Fang, Han Liu, and Ji Liu ORFE@Princeton ICCOPT, Tokyo, August 8-11, 2016 1 / 24 Collaborators
More informationBias-free Sparse Regression with Guaranteed Consistency
Bias-free Sparse Regression with Guaranteed Consistency Wotao Yin (UCLA Math) joint with: Stanley Osher, Ming Yan (UCLA) Feng Ruan, Jiechao Xiong, Yuan Yao (Peking U) UC Riverside, STATS Department March
More informationTutorial: PART 2. Optimization for Machine Learning. Elad Hazan Princeton University. + help from Sanjeev Arora & Yoram Singer
Tutorial: PART 2 Optimization for Machine Learning Elad Hazan Princeton University + help from Sanjeev Arora & Yoram Singer Agenda 1. Learning as mathematical optimization Stochastic optimization, ERM,
More informationEUSIPCO
EUSIPCO 013 1569746769 SUBSET PURSUIT FOR ANALYSIS DICTIONARY LEARNING Ye Zhang 1,, Haolong Wang 1, Tenglong Yu 1, Wenwu Wang 1 Department of Electronic and Information Engineering, Nanchang University,
More informationAn algebraic perspective on integer sparse recovery
An algebraic perspective on integer sparse recovery Lenny Fukshansky Claremont McKenna College (joint work with Deanna Needell and Benny Sudakov) Combinatorics Seminar USC October 31, 2018 From Wikipedia:
More informationDesigning Information Devices and Systems I Discussion 13B
EECS 6A Fall 7 Designing Information Devices and Systems I Discussion 3B. Orthogonal Matching Pursuit Lecture Orthogonal Matching Pursuit (OMP) algorithm: Inputs: A set of m songs, each of length n: S
More informationLeast Sparsity of p-norm based Optimization Problems with p > 1
Least Sparsity of p-norm based Optimization Problems with p > Jinglai Shen and Seyedahmad Mousavi Original version: July, 07; Revision: February, 08 Abstract Motivated by l p -optimization arising from
More informationECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis
ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis Lecture 3: Sparse signal recovery: A RIPless analysis of l 1 minimization Yuejie Chi The Ohio State University Page 1 Outline
More informationA Quantum Interior Point Method for LPs and SDPs
A Quantum Interior Point Method for LPs and SDPs Iordanis Kerenidis 1 Anupam Prakash 1 1 CNRS, IRIF, Université Paris Diderot, Paris, France. September 26, 2018 Semi Definite Programs A Semidefinite Program
More informationRobustness Meets Algorithms
Robustness Meets Algorithms Ankur Moitra (MIT) ICML 2017 Tutorial, August 6 th CLASSIC PARAMETER ESTIMATION Given samples from an unknown distribution in some class e.g. a 1-D Gaussian can we accurately
More informationNew Algorithms for Learning Incoherent and Overcomplete Dictionaries
New Algorithms for Learning Incoherent and Overcomplete Dictionaries Sanjeev Arora Rong Ge Ankur Moitra May 27, 2014 arxiv:1308.6273v5 [cs.ds] 26 May 2014 Abstract In sparse recovery we are given a matrix
More informationAlgorithms for NLP. Language Modeling III. Taylor Berg- Kirkpatrick CMU Slides: Dan Klein UC Berkeley
Algorithms for NLP Language Modeling III Taylor Berg- Kirkpatrick CMU Slides: Dan Klein UC Berkeley Tries Context Encodings [Many details from Pauls and Klein, 2011] Context Encodings Compression Idea:
More informationInfo-Greedy Sequential Adaptive Compressed Sensing
Info-Greedy Sequential Adaptive Compressed Sensing Yao Xie Joint work with Gabor Braun and Sebastian Pokutta Georgia Institute of Technology Presented at Allerton Conference 2014 Information sensing for
More informationA tutorial on sparse modeling. Outline:
A tutorial on sparse modeling. Outline: 1. Why? 2. What? 3. How. 4. no really, why? Sparse modeling is a component in many state of the art signal processing and machine learning tasks. image processing
More informationAnalysis of Greedy Algorithms
Analysis of Greedy Algorithms Jiahui Shen Florida State University Oct.26th Outline Introduction Regularity condition Analysis on orthogonal matching pursuit Analysis on forward-backward greedy algorithm
More informationContents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016
ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................
More informationStochastic Proximal Gradient Algorithm
Stochastic Institut Mines-Télécom / Telecom ParisTech / Laboratoire Traitement et Communication de l Information Joint work with: Y. Atchade, Ann Arbor, USA, G. Fort LTCI/Télécom Paristech and the kind
More informationProximal Methods for Optimization with Spasity-inducing Norms
Proximal Methods for Optimization with Spasity-inducing Norms Group Learning Presentation Xiaowei Zhou Department of Electronic and Computer Engineering The Hong Kong University of Science and Technology
More informationOverview. Optimization-Based Data Analysis. Carlos Fernandez-Granda
Overview Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda 1/25/2016 Sparsity Denoising Regression Inverse problems Low-rank models Matrix completion
More informationSparse & Redundant Signal Representation, and its Role in Image Processing
Sparse & Redundant Signal Representation, and its Role in Michael Elad The CS Department The Technion Israel Institute of technology Haifa 3000, Israel Wave 006 Wavelet and Applications Ecole Polytechnique
More informationA Polynomial Time Algorithm for Lossy Population Recovery
A Polynomial Time Algorithm for Lossy Population Recovery Ankur Moitra Massachusetts Institute of Technology joint work with Mike Saks A Story A Story A Story A Story A Story Can you reconstruct a description
More informationOWL to the rescue of LASSO
OWL to the rescue of LASSO IISc IBM day 2018 Joint Work R. Sankaran and Francis Bach AISTATS 17 Chiranjib Bhattacharyya Professor, Department of Computer Science and Automation Indian Institute of Science,
More informationRecovering any low-rank matrix, provably
Recovering any low-rank matrix, provably Rachel Ward University of Texas at Austin October, 2014 Joint work with Yudong Chen (U.C. Berkeley), Srinadh Bhojanapalli and Sujay Sanghavi (U.T. Austin) Matrix
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization
More informationSome Recent Advances. in Non-convex Optimization. Purushottam Kar IIT KANPUR
Some Recent Advances in Non-convex Optimization Purushottam Kar IIT KANPUR Outline of the Talk Recap of Convex Optimization Why Non-convex Optimization? Non-convex Optimization: A Brief Introduction Robust
More informationLecture: Introduction to Compressed Sensing Sparse Recovery Guarantees
Lecture: Introduction to Compressed Sensing Sparse Recovery Guarantees http://bicmr.pku.edu.cn/~wenzw/bigdata2018.html Acknowledgement: this slides is based on Prof. Emmanuel Candes and Prof. Wotao Yin
More informationGREEDY SIGNAL RECOVERY REVIEW
GREEDY SIGNAL RECOVERY REVIEW DEANNA NEEDELL, JOEL A. TROPP, ROMAN VERSHYNIN Abstract. The two major approaches to sparse recovery are L 1-minimization and greedy methods. Recently, Needell and Vershynin
More informationRobust Statistics, Revisited
Robust Statistics, Revisited Ankur Moitra (MIT) joint work with Ilias Diakonikolas, Jerry Li, Gautam Kamath, Daniel Kane and Alistair Stewart CLASSIC PARAMETER ESTIMATION Given samples from an unknown
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 11, 2016 Paper presentations and final project proposal Send me the names of your group member (2 or 3 students) before October 15 (this Friday)
More informationOn the fast convergence of random perturbations of the gradient flow.
On the fast convergence of random perturbations of the gradient flow. Wenqing Hu. 1 (Joint work with Chris Junchi Li 2.) 1. Department of Mathematics and Statistics, Missouri S&T. 2. Department of Operations
More information2.3. Clustering or vector quantization 57
Multivariate Statistics non-negative matrix factorisation and sparse dictionary learning The PCA decomposition is by construction optimal solution to argmin A R n q,h R q p X AH 2 2 under constraint :
More informationUses of duality. Geoff Gordon & Ryan Tibshirani Optimization /
Uses of duality Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Remember conjugate functions Given f : R n R, the function is called its conjugate f (y) = max x R n yt x f(x) Conjugates appear
More informationRémi Gribonval - PANAMA Project-team INRIA Rennes - Bretagne Atlantique, France
Sparse dictionary learning in the presence of noise and outliers Rémi Gribonval - PANAMA Project-team INRIA Rennes - Bretagne Atlantique, France remi.gribonval@inria.fr Overview Context: inverse problems
More informationNeural Nets and Symbolic Reasoning Hopfield Networks
Neural Nets and Symbolic Reasoning Hopfield Networks Outline The idea of pattern completion The fast dynamics of Hopfield networks Learning with Hopfield networks Emerging properties of Hopfield networks
More informationMini-Course 1: SGD Escapes Saddle Points
Mini-Course 1: SGD Escapes Saddle Points Yang Yuan Computer Science Department Cornell University Gradient Descent (GD) Task: min x f (x) GD does iterative updates x t+1 = x t η t f (x t ) Gradient Descent
More informationLecture 24 May 30, 2018
Stats 3C: Theory of Statistics Spring 28 Lecture 24 May 3, 28 Prof. Emmanuel Candes Scribe: Martin J. Zhang, Jun Yan, Can Wang, and E. Candes Outline Agenda: High-dimensional Statistical Estimation. Lasso
More informationRobust Spectral Inference for Joint Stochastic Matrix Factorization
Robust Spectral Inference for Joint Stochastic Matrix Factorization Kun Dong Cornell University October 20, 2016 K. Dong (Cornell University) Robust Spectral Inference for Joint Stochastic Matrix Factorization
More informationBEYOND MATRIX COMPLETION
BEYOND MATRIX COMPLETION ANKUR MOITRA MASSACHUSETTS INSTITUTE OF TECHNOLOGY Based on joint work with Boaz Barak (MSR) Part I: RecommendaIon systems and parially observed matrices THE NETFLIX PROBLEM movies
More informationRecovery of Simultaneously Structured Models using Convex Optimization
Recovery of Simultaneously Structured Models using Convex Optimization Maryam Fazel University of Washington Joint work with: Amin Jalali (UW), Samet Oymak and Babak Hassibi (Caltech) Yonina Eldar (Technion)
More informationSKA machine learning perspec1ves for imaging, processing and analysis
1 SKA machine learning perspec1ves for imaging, processing and analysis Slava Voloshynovskiy Stochas1c Informa1on Processing Group University of Geneva Switzerland with contribu,on of: D. Kostadinov, S.
More informationLEARNING OVERCOMPLETE SPARSIFYING TRANSFORMS FOR SIGNAL PROCESSING. Saiprasad Ravishankar and Yoram Bresler
LEARNING OVERCOMPLETE SPARSIFYING TRANSFORMS FOR SIGNAL PROCESSING Saiprasad Ravishankar and Yoram Bresler Department of Electrical and Computer Engineering and the Coordinated Science Laboratory, University
More informationEquivalence Probability and Sparsity of Two Sparse Solutions in Sparse Representation
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 19, NO. 12, DECEMBER 2008 2009 Equivalence Probability and Sparsity of Two Sparse Solutions in Sparse Representation Yuanqing Li, Member, IEEE, Andrzej Cichocki,
More informationNon-Convex Optimization. CS6787 Lecture 7 Fall 2017
Non-Convex Optimization CS6787 Lecture 7 Fall 2017 First some words about grading I sent out a bunch of grades on the course management system Everyone should have all their grades in Not including paper
More informationUsing SVD to Recommend Movies
Michael Percy University of California, Santa Cruz Last update: December 12, 2009 Last update: December 12, 2009 1 / Outline 1 Introduction 2 Singular Value Decomposition 3 Experiments 4 Conclusion Last
More informationIntroduction to Sparsity. Xudong Cao, Jake Dreamtree & Jerry 04/05/2012
Introduction to Sparsity Xudong Cao, Jake Dreamtree & Jerry 04/05/2012 Outline Understanding Sparsity Total variation Compressed sensing(definition) Exact recovery with sparse prior(l 0 ) l 1 relaxation
More informationAn Introduction to Spectral Learning
An Introduction to Spectral Learning Hanxiao Liu November 8, 2013 Outline 1 Method of Moments 2 Learning topic models using spectral properties 3 Anchor words Preliminaries X 1,, X n p (x; θ), θ = (θ 1,
More informationSTAT 200C: High-dimensional Statistics
STAT 200C: High-dimensional Statistics Arash A. Amini May 30, 2018 1 / 57 Table of Contents 1 Sparse linear models Basis Pursuit and restricted null space property Sufficient conditions for RNS 2 / 57
More informationDistributed Inexact Newton-type Pursuit for Non-convex Sparse Learning
Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Bo Liu Department of Computer Science, Rutgers Univeristy Xiao-Tong Yuan BDAT Lab, Nanjing University of Information Science and Technology
More informationLecture 6. Notes on Linear Algebra. Perceptron
Lecture 6. Notes on Linear Algebra. Perceptron COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Andrey Kan Copyright: University of Melbourne This lecture Notes on linear algebra Vectors
More informationNon-Convex Optimization in Machine Learning. Jan Mrkos AIC
Non-Convex Optimization in Machine Learning Jan Mrkos AIC The Plan 1. Introduction 2. Non convexity 3. (Some) optimization approaches 4. Speed and stuff? Neural net universal approximation Theorem (1989):
More informationLecture 5 : Projections
Lecture 5 : Projections EE227C. Lecturer: Professor Martin Wainwright. Scribe: Alvin Wan Up until now, we have seen convergence rates of unconstrained gradient descent. Now, we consider a constrained minimization
More informationApplied Machine Learning for Biomedical Engineering. Enrico Grisan
Applied Machine Learning for Biomedical Engineering Enrico Grisan enrico.grisan@dei.unipd.it Data representation To find a representation that approximates elements of a signal class with a linear combination
More informationSum-of-Squares and Spectral Algorithms
Sum-of-Squares and Spectral Algorithms Tselil Schramm June 23, 2017 Workshop on SoS @ STOC 2017 Spectral algorithms as a tool for analyzing SoS. SoS Semidefinite Programs Spectral Algorithms SoS suggests
More information15-850: Advanced Algorithms CMU, Fall 2018 HW #4 (out October 17, 2018) Due: October 28, 2018
15-850: Advanced Algorithms CMU, Fall 2018 HW #4 (out October 17, 2018) Due: October 28, 2018 Usual rules. :) Exercises 1. Lots of Flows. Suppose you wanted to find an approximate solution to the following
More informationLinear Methods for Regression. Lijun Zhang
Linear Methods for Regression Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Linear Regression Models and Least Squares Subset Selection Shrinkage Methods Methods Using Derived
More informationMinimizing the Difference of L 1 and L 2 Norms with Applications
1/36 Minimizing the Difference of L 1 and L 2 Norms with Department of Mathematical Sciences University of Texas Dallas May 31, 2017 Partially supported by NSF DMS 1522786 2/36 Outline 1 A nonconvex approach:
More informationhttps://goo.gl/kfxweg KYOTO UNIVERSITY Statistical Machine Learning Theory Sparsity Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT OF INTELLIGENCE SCIENCE AND TECHNOLOGY 1 KYOTO UNIVERSITY Topics:
More informationComputational statistics
Computational statistics Lecture 3: Neural networks Thierry Denœux 5 March, 2016 Neural networks A class of learning methods that was developed separately in different fields statistics and artificial
More informationAppendix A. Proof to Theorem 1
Appendix A Proof to Theorem In this section, we prove the sample complexity bound given in Theorem The proof consists of three main parts In Appendix A, we prove perturbation lemmas that bound the estimation
More informationSummary and discussion of: Dropout Training as Adaptive Regularization
Summary and discussion of: Dropout Training as Adaptive Regularization Statistics Journal Club, 36-825 Kirstin Early and Calvin Murdock November 21, 2014 1 Introduction Multi-layered (i.e. deep) artificial
More informationSelf-Calibration and Biconvex Compressive Sensing
Self-Calibration and Biconvex Compressive Sensing Shuyang Ling Department of Mathematics, UC Davis July 12, 2017 Shuyang Ling (UC Davis) SIAM Annual Meeting, 2017, Pittsburgh July 12, 2017 1 / 22 Acknowledgements
More information1 Sparsity and l 1 relaxation
6.883 Learning with Combinatorial Structure Note for Lecture 2 Author: Chiyuan Zhang Sparsity and l relaxation Last time we talked about sparsity and characterized when an l relaxation could recover the
More information