Online Dictionary Learning with Group Structure Inducing Norms

Size: px

Start display at page:

Download "Online Dictionary Learning with Group Structure Inducing Norms"

Aileen Burns
5 years ago
Views:

1 Online Dictionary Learning with Group Structure Inducing Norms Zoltán Szabó 1, Barnabás Póczos 2, András Lőrincz 1 1 Eötvös Loránd University, Budapest, Hungary 2 Carnegie Mellon University, Pittsburgh, USA ICML, Structured Sparsity July 2, 2011

2 Contents Sparse coding, structured sparsity, Structured dictionary learning: Our requirements, Cost function, Special cases, Optimization. Numerical examples.

3 Sparse coding Observation (x) = linear combination of a few vectors (α) from a fixed dictionary (D). l 0 -norm solution: NP-hard. Popular relaxations: l p (0 < p 1) norm. Special case: l 1, Lasso problem, efficient algorithms, [ ] 1 min α 2 x Dα κ α 1. (1) Disadvantage: prior knowledge on the structure of the hidden code is not taken into account.

4 Structured sparsity Different kind of structures (e.g., disjunct groups, trees) on the sparse codes increased performances in several applications: robust CS with substantially fewer observations, multi-task learning problems, structure learning in graphical models, natural language processing, fmri analysis, face expression discrimination/recognition.

5 Structured dictionary learning Both dictionary learning (sparse) principal component analysis, (sparse) non-negative matrix factorization (NMF), independent component analysis, independent subspace analysis, and structured sparse coding are very popular. However, very few works have focused on the combination of these two tasks.

6 Structured dictionary learning: wanted properties Interested in algorithms with the following four properties: handle general, overlapping group structures, online: fast, memory efficient, adaptive, non-convex sparsity inducing regularization: fewer measurements, weaker conditions on the dictionary, robust (w.r.t. noise, compressibility). can deal with missing information. Current approaches: handle 2.

7 Cost function Notation: α hidden representation, x observation, D dictionary, G group structure (set system) 2 {1,...,dα}. Group structure inducing on the hidden representation: Ω(α) = ( αg 2 ) G G η, (2) Ω(α) = ( d G α 2 ) G G η, (3) Ω(α) = ( A G α 2 ) G G η, η (0, 2). (4) Approximate on the observed coordinates (x O ): 1 2 x O D O α 2 2. (5)

8 Cost function continued Loss for a fixed observation (κ > 0): l(x O, D O ) = min α [ 1 2 x O D O α κω(α) ]. (6) Goal (OSDL): minimize the average loss of the dictionary min f t (D) := 1 D t t l(x Oi, D Oi ). (7) i=1 Possible dictionary/representation constraints: D D = dα i=1 D i R dx dα : closed, convex, and bounded. α A R dα : convex, closed.

9 Special cases O i = {1,...,d x } ( i): fully observed OSDL task. Special cases for G: Traditional sparse dictionary G = {{1}, {2},..., {d α }}. Hierarchical dictionary G = descendants of the nodes. Grid adopted dictionary G = nearest neighbors of the nodes. Group Lasso G = partition. Elastic net G = singletons and {1,...,d α }. Contiguous code G = intervals.

10 Special cases continued Special cases for {A G } G G : Fused Lasso Ω(α) = dα 1 Graph-guided fusion penalty Ω(α) = Linear trend/polynomial filtering j=1 Ω(α) = dα 1 α j+1 α j. e=(i,j) E:i<j j=2 Generalized Lasso penalty Ω(α) = Aα 1. Total variation Ω(α) = d 1 d 2 i=1 j=1 w ij α i v ij α j. α j 1 + 2α j α j+1. ( α)ij 2.

11 Special cases continued Special cases for D,A: Traditional setting l 2 constrained D. Structured NMF non-negative D and α. Structured mixture-of-topics l 1 constained D, non-negative D, α. Hard representation constraints group norm/elastic net/ fused Lasso constrained α. Double structured dictionaries group norm constraints to α and D.

12 OSDL optimization Online optimization of D through alternations: For fix D t 1 and x Ot, α t is the solution of [ 1 α t = argmin xot (D t 1 ) α A 2 Ot α ] κω(α). (8) Using {α i } t i=1, D t is updated by means of the quadratic optimization ˆft (D t ) = min f t(d, {α i } t i=1 ). (9) D D Solution idea: variational property of η ; BCD + 3 different ˆf t statistics + matrix recursions.

Numerical examples inpainting of natural images

13 Numerical examples inpainting of natural images Structured (toroid) vs. unstructured dictionary: 13 19% improvement. Efficiency in case of missing observations: MSE grows slowly, p tr = 0.9 (training incompleteness: 90%) is still OK. Left: unstructured; center: structured; right: structured, incomplete observations.

14 Numerical examples inpainting, full unseen image Learning: p tr = 0.5. Inpainting: p val test = 0.7

15 Numerical examples inpainting, full unseen image Learning: p tr = 0.5. Inpainting: ptest val = 0.7 (PSNR = 29 db):

16 Numerical examples online structured NMF on faces Online, G-NMF: special case of OSDL. Illustration: color FERET, sized facial dataset. G: complete, 8-level binary tree (d α = 255).

17 Numerical examples collaborative filtering Joke recommendation (Jester): 100 jokes 73, 421 users. Observation: x Ot = ratings of the t th user. Baseline: best known RMSEs (item neighbor), (unstructured dictionary, d α = 100). Result: toroid G (d α = 100): RMSE = , hierarchical G (d α = 15): RMSE =

18 Conclusions We developed a dictionary learning method, which enables general overlapping group structures, is online, applies non-convex sparsity inducing regularization, can deal with missing information. Dictionary learning for several actively studied structured sparse coding problems. Numerical examples: inpainting of natural images, structured NMF, collaborative filtering.

19 Acknowledgments The research was partly supported by the Department of Energy (grant number DESC ).

20 Thank you for the attention!

21 Representation optimization (α) Structured sparse coding task: 1 x 2 Ot (D t 1 ) Ot α 2 + κω(α) min 2. (10) α A Solution: let us use the y η = min z R d + [ 1 2 d i=1 y 2 j z j z β ], (11) variational property of η, where y R d, β = η 2 η, and the minimum value is attained at z i = y i 2 η y η 1 η.

22 Representation optimization (α) continued Our problem is equivalent to the solution of J(α, z) = 1 2 xot (D t 1 ) Ot α ) (α 2 2 +κ1 T Hα + z 2 β min α A,z R G +, where H = H(z) = G G(A G ) T A G /z G. (12) One can optimize J(α, z) by iterative alternating steps: For given α: explicit formula for the optimal z = (z G ) G G z G = A G α 2 η 2 ( A G α 2 ) G G η 1 η. (13) For given α: quadratic cost on the convex set A.

23 Dictionary optimization (D) Cost function (ρ: non-negative forgetting factor): ˆft (D) = 1 t j=1 (j/t)ρ t i=1 ( ) ρ [ ] i 1 t 2 x O i D Oi α i κω(α i) min. D D Optimization (BCD): optimize in d j, while the other columns (d i, i j) are fixed. ˆft is quadratic in d j : 1 Solve the equation: ˆf t d j (u j ) = 0. (14) 2 Project the solution to the constraint set D j : d j = Π Dj (u j ). (15)

24 Computation of u j Task: Solution: u j satisfies the linear equation ˆf t d j (u j ) = 0. (16) C j,t u j = b j,t e j,t + C j,t d j, (17) where for the {{C j,t } dα j=1, B t, {e j,t } dα j=1 } statistics

25 Computation of u j continued C j,t = B t = e j,t = t i=1 t i=1 t i=1 ( ) i ρ i α 2 i,j R dx dx (j = 1,...,d α ), (18) t ( ) i ρ i x i α T i = [b 1,t,..., b dα,t] R dx dα, (19) t ( ) i ρ i Dα i α i,j R dx (j = 1,...,d α ), (20) t where C j,t and i s are diagonal; i matrix O i (element j in the diagonal is 1 if j O i, and 0 otherwise).

26 Matrix recursion lemma Let N t R L 1 L 2 (t = 1, 2,...) be a given matrix series, γ t = ( 1 1 ) ρ, t ρ 0, the M t and M t matrix series be defined as M t = γ t M t 1 + N t R L 1 L 2 (t = 1, 2,...), (21) t ( ) i ρ M t = N i R L 1 L 2 t (t = 1, 2,...). (22) i=1 If ρ = 0, then M t = M 0 + M t M t = M t ( t 1). ( t 1). When ρ > 0, then

27 Computation of u j continued Matrix recursion lemma one can update C j,t and B t as C j,t = γ t C j,t 1 + t α 2 tj, (23) B t = γ t B t 1 + t x t α T t, (24) with C j,0 = 0, B 0 = 0 (ρ = 0), or arbitrary initialization (ρ > 0). Numerical experiences efficient online approximation for e j,t : e j,t = γ t e j,t 1 + t Dα t α t,j, (25) with the actual estimation D and initialization e j,0 = 0.

28 Special, fully observable case In this case ( i = I, i): that is C j,t = I e j,t = t i=1 t i=1 ( ) i ρ α 2 i,j t, B t = ( i t ) ρ Dα i α i,j = D D can be pulled out from e j,t s, and t i=1 t i=1 ( ) i ρ x i α T i, (26) t ( i t it is sufficient to maintain 2 statistics, B t and ) ρ α i α i,j, (27) A t = t i=1 ( ) i ρ α i α T i R dα dα. (28) t

Collaborative Filtering via Group-Structured Dictionary Learning

Collaborative Filtering via Group-Structured Dictionary Learning Zoltán Szabó 1, Barnabás Póczos 2, and András Lőrincz 1 1 Faculty of Informatics, Eötvös Loránd University, Pázmány Péter sétány 1/C, H-1117