Appendix to Online l 1 -Dictionary Learning with Application to Novel Document Detection

Appendix o Online l -Dicionary Learning wih Applicaion o Novel Documen Deecion Shiva Prasad Kasiviswanahan Huahua Wang Arindam Banerjee Prem Melville A Background abou ADMM In his secion, we give a brief review of he general framework of ADMM. ADMM has recenly gahered significan aenion in he machine learning communiy due o is wide applicabiliy o a range of learning problems wih complex objecive funcions [, ]. Le px : R a R and qy : R b R be convex funcions, F R c a, G R c b, and z R c. Consider he following opimizaion problem px + qy s.. F x + Gy = z, x,y where he variable vecors x and y are separae in he objecive, and coupled only in he consrain. The augmened Lagrangian for he above problem is given by Lx, y, ρ = px + qy + ρ z F x Gy + ϕ z F x Gy, where ρ R c is he Lagrangian muliplier and ϕ > 0 is a penaly parameer. ADMM uilizes he separabiliy form of and replaces he join imizaion over x and y wih wo simpler problems. The ADMM firs imizes L over x, hen over y, and hen applies a proximal imizaion sep wih respec o he Lagrange muliplier ρ. The enire ADMM procedure is summarized in Algorihm. The γ > 0 is a consan. The subscrip i denoes he ih ieraion of he ADMM procedure. The ADMM procedure has been proved o converge o he global opimal soluion under quie broad condiions []. Algorihm : ADMM Updae Equaions for Solving Ierae unil convergence x i+ arg x Lx, y i, ρ i, y i+ arg y Lx i+, y, ρ i, ρ i+ ρ i + γϕz F x i+ Gy i+. A. ADMM Equaions for updaing X and A s Consider he l -dicionary learning problem A A,X 0 P AX + λ X, where A is defined in Secion 3.. We use he following algorihm from [4] o solve his problem. I is quie easy o adap he ADMM updaes oulined in Algorihm o updae X s and A s, when he oher variable is fixed see e.g., [4]. ADMM for updaing X, given fixed A. Here we are given marices P R m n and A R m k, and we wan o solve he following opimizaion problem X 0 P AX + λ X. Algorihm shows he ADMM updae seps for solving his problem. The enire derivaion is presened in [4] and we are reproducing hem here for compleeness. In our experimens, we se

ϕ = 5, κ = /Ψ max A, and γ =.89. These parameers are chosen based on he ADMM convergence resuls presened in [4, 6]. Algorihm : ADMM for Updaing X ADMM procedure for solving X 0 P AX + λ X Inpu: A R m k, P R m n, λ 0, γ 0, ψ 0, κ 0 X 0 k n, E P, ρ 0 m n for i =,,..., o convergence do E i+ sofp AX i + ρ i /ϕ, /ϕ G A AX i + E i+ P ρ i /ϕ X i+ max { X i κg λκ/ϕ, 0 } ρ i+ ρ i + γϕp AX i+ E i+ Reurn X a convergence ADMM for Updaing A, given fixed X. Given inpus P R m n and X R k n, consider he following opimizaion problem A A P AX. When repeaing his opimizaion over muliple imeseps, we use warm sars for faser convergence, i.e., insead of iniializing A o 0 m k, we iniialize A o he dicionary obained a he end of he previous imesep. Algorihm 3 : ADMM for Updaing A ADMM procedure for solving A A P AX Inpu: X R k n, P R m n, γ 0, ψ 0, κ 0 A 0 m k, E P, ρ 0 m n for i =,,..., o convergence do E i+ sofp A i X + ρ i /ϕ, /ϕ G A i X + E i+ P ρ i /ϕx A i+ Π Amax { A i κg, 0 } ρ i+ ρ i + γϕp A i+ X E i+ Reurn A a convergence B Analysis of OIADMM: Proofs from Secion 4 Firs, les recap he OIADMM updae rules. Γ + = arg Γ Â + = arg A A Γ +, Γ Γ + β Γ Γ F. β G +, A Â + τ A Â F, 3 + = + β P Â+X Γ +. 4 Le A op be he opimum soluion o he bach problem P AX. A A = Le Γ = P Â ˆX and Γ = P Â+ ˆX. For any, A A, le Γ = P A ˆX. The lemmas below hold for any A A so in paricular i holds for A se as A op. Proof Flow. Alhough he algorihm is relaively simple, he analysis is somewha involved. Define, Γ op = P A op X. Then he regre of he OIADMM is RT = Γ Γ op. = We spli he proof ino hree echnical lemmas. We firs upper bound, Γ Γ Lemma B., and use i o bound Γ + Γ Lemma B.3. In he proof of Lemma B.4, we bound Γ Γ + and his when added o he bound on Γ + Γ from Lemma B.3 gives a bound

on Γ Γ. The proof of he regre bound uses a canceling elescoping sum on he bound on Γ Γ. We use he following simple inequaliy in our proofs. Lemma B.. For marices M, M, M 3, M 4 R m n, we have he following M M, M 3 M 4 = M M 4 F + M M 3 F M M 3 F M M 4 F. Lemma B.. Le {Γ, Â, } be he sequences generaed by he OIADMM procedure. For any A A, we have, Γ Γ β A τ Â F A Â+ F + β Γ Γ + F Γ + Γ F Γ Γ F β Ψ max τ ˆX Â+ Â F. Proof. For any A A, 3 is equivalen o he following variaional inequaliy [5]: β G + + τ Â+ Â, A Â+ 0. 5 Using Γ = P Â+ ˆX and subsiuing for G +, we have β G +, A Â+ = β /β + Γ Γ + ˆX, A Â+ = β /β + Γ Γ +, Â+ ˆX A ˆX = β /β + Γ Γ +, P A ˆX P Â+ ˆX =, Γ Γ + β Γ Γ +, Γ Γ. 6 Subsiuing 6 ino 5 and rearranging he erms yield, Γ Γ β Γ Γ +, Γ Γ + β τ Â+ Â, A Â+. 7 By using Lemma B., he firs erm on he righ side can be rewrien as Γ Γ +, Γ Γ = Γ Γ F + Γ Γ + F Γ + Γ F Γ Γ F. 8 Subsiuing he definiions of Γ and Γ, we have Γ Γ F = P Â ˆX P Â+ ˆX F = Â+ Â ˆX F Ψ max ˆX Â+ Â F, 9 Remember ha Ψ max ˆX is he maximum eigenvalue of X X. Using Lemma B., we ge ha he second erm in he righ hand side of 7 is equivalen o Â+ Â, A Â+ = A Â F A Â+ F Â+ Â F. 0 Combining resuls in 7, 8, 9, and 0, we ge he desired bound. Lemma B.3. Le {Γ, Â, } be he sequences generaed by he OIADMM procedure. For any A A, we have Γ + Γ F + β F + A β τ Â F A Â+ F β Ψ max τ ˆX Â+ Â F β Γ + Γ F. Proof. Le Γ + denoe he subgradien of Γ +. Now Γ + is a imizer of. Therefore, 0 m n Γ + β Γ Γ +. Rearranging he erms gives + β Γ Γ + Γ +. Since Γ + is a convex funcion, we have Γ + Γ + β Γ Γ +, Γ + Γ, Γ + Γ +, Γ Γ + β Γ Γ +, Γ + Γ. Using Lemma B., he las erm can be rewrien as β Γ Γ +, Γ + Γ = β Γ Γ F Γ Γ + F Γ + Γ F

Combining he inequaliy of Lemma B. wih gives, Γ Γ + β Γ Γ +, Γ + Γ β A τ Â F A Â+ F β Ψ max τ ˆX Â+ Â F β Γ + Γ F Γ + Γ F. 3 Since Γ + Γ = + /β, we have, Γ + Γ β Γ + Γ F =, + + F β = F + F. 4 β Plugging 3 and 4 ino yields he resul. Lemma B.4. Le {Γ, Â, } be he sequences generaed by he OIADMM procedure. If τ saisfies Ψ max ˆX. Then τ Γ Γ Λ F + F + β F + A β β τ Â F A Â+ F, where Λ Γ. Proof. Le Λ Γ. Therefore, Γ Γ + Λ, Γ Γ +. Now, Therefore, Λ, Γ Γ + = Λ / β, β Γ Γ + β Λ F + β Γ Γ + F Γ Γ + β Λ F + β Γ Γ + F. 5 Adding 5 and he inequaliy of Lemma B.3 ogeher we ge Γ Γ β Λ F + β F + F + β τ β A Â F A Â+ F Â+ Â F. τ Ψ max ˆX Seing /τ Ψ max ˆX means ha β / τ Ψ max ˆX Â+ Â F 0, Therefore, Γ Γ Λ F + F + β F + A β β τ Â F A Â+ F, Theorem B.5 Theorem 4. Resaed. Le {Γ, Â, } be he sequences generaed by he OIADMM procedure and RT be defined as above. Assume he following condiions hold: i he Frobenius norm of Γ is upper bounded by Φ, ii Â 0 = 0 m k, A op F D, iii 0 = 0 m n, and iv /τ Ψ max ˆX. Seing β = Φ/D τ T, we have RT ΦD T T τ + A op E. =

Proof. Subsiuing, Γ op = P A op ˆX for Γ and A op for A in Lemma B.4 and sumg he inequaliy of over from o T, we ge he following canceling elescoping sum Γ Γ op = β β Φ T β Since = = Λ F + β 0 F T + F + β τ A op Â0 F A op ÂT + F Λ F + β τ A op F β T + F β τ A op ÂT + F + D β τ. Γ op = P A op X = P A op ˆX + E = Γ op A op E, we have hen Γ op Γ op A op E. The regre is bounded as follows: RT = Γ Γ op Φ T + D β + A op E. β τ = Seing β = Φ D τ T yields desired bound. = As menioned in Secion 4, OIADMM can violae he equaliy consrain a each i.e., P Â + ˆX Γ +. However, we show in Theorem B.6 ha he accumulaed loss caused by he violaion of equaliy consrain is sublinear in T, i.e., he equaliy consrain is saisfied on average in he long run. Theorem B.6. Le {Γ, Â, } be he sequences generaed by he OIADMM procedure and RT be defined as above. Assume he following condiions hold: i he Frobenius norm of Γ is upper bounded by Φ, ii Â0 = 0 m k, A op F D, iii 0 = 0 m n, iv /τ Ψ max ˆX, and v Γ op Υ. Seing β = Φ/D τ T, we have = Γ + Γ D τ + 4ΥD Φ τ T. Proof. Les look a Γ + Γ F. Γ + Γ F = Γ + Γ + Γ Γ F Γ + Γ F + Γ Γ F Γ + Γ F + Ψ max ˆX Â+ Â F. 6 For he firs inequaliy, we used he simple fac ha for any wo marices M and M M M F M F + M F. The second inequaliy is because of 9. Firsly, since Γ + 0 Γ + Γ op Γ op Υ. Using his and rearranging erms in he inequaliy of Lemma B.3 wih A op insead of A gives Γ + Γ F β F + F + A op τ Â F A op Â+ F Ψ max τ ˆX Â+ Â F + Υ, β Plugging his ino 6 yields Γ + Γ F β F + F + A op τ Â F A op Â+ F Ψ max τ ˆX Â+ Â F + 4Υ. β

Leing /τ Ψ max ˆX and sumg over from o T, we have Γ + Γ F β = D + 4ΥT. τ β Seing β = Φ D τ T yields he desired bound. C Pseudo-Codes from Secion 5 0 F T + F + τ A op Â0 F A op ÂT + F + 4ΥT β Le us sar by exending he definiion of A, define A k = {A R m k : A 0 m k j =,..., k, A j }, where A j is he jh column in A. We use Π Ak o denoe he projecion ono he neares poin in he convex se A k. Algorihm 4 : BATCH-IMPL Inpu: P [ ] R m N, X [ ] R k N, P = [p,..., p n ] R m n, A R m k, λ, ζ, η 0 Novel Documen Deecion Sep: for j = o n do Solve: x j = arg x 0 p j A x + λ x solved using Algorihm if p j A x j + λ x j > ζ Mark p j as novel Bach Dicionary Learning Sep: Se k + k + η Se Z [] [X [ ] x,..., x n ] Se X [] [ Z[] 0 η N ] Se P [] [P [ ] p,..., p n ] for i = o convergence do Solve: A + = arg A Ak+ P [] AX [] solved using Algorihm 3 wih warm sars Solve: X [] = arg X 0 P [] A +X + λ X solved using Algorihm Define A k as A k = {A R m k : A 0 m k j =,..., k, A j }, where A j is he jh column in A. We use Π Ak o denoe he projecion ono he neares poin in he convex se A k. Algorihm 5 : L-BATCH Inpu: P [ ] R m N, P = [p,..., p n ] R m n, A R m k, λ 0, ζ 0, η 0 Novel Documen Deecion Sep: for j = o n do Solve: x j = arg x 0 p j A x + λ x solved using he LARS mehod [3] if p j A x j + λ x j > ζ Mark p j as novel l -bach Dicionary Learning Sep: Se k + k + η Se P [] [P [ ] p,..., p n ] [A +, X [] ] = arg A Ak+,X 0 P [] AX + λ X non-negaive sparse coding problem D Addiional Experimenal Evaluaion In Figure, we show he effec of he size of he dicionary on he performance of Algorihm ON- LINE. The average AUC is compued as in Table. No surprisingly, as he size of dicionary k increases he average AUC also increases, bu correspondingly he running ime of he algorihm also increases. The plo suggess ha here is a diishing reurn on AUC wih increase in he size of he dicionary, and his increase in AUC comes a he cos of higher running imes. Pos-processing done o Generae Table. In each imesep, insead of hresholding by ζ, we ake he op 0% of wees measured in erms of he sparse coding objecive value and run a dicionarybased clusering, described in [4], on i. Furher pos-processing is done o discard clusers wihou much suppor and o pick a represenaive wee for each cluser.

AUC vs. Time for Differen Online Dicionary Sizes 0.9 Average AUC 0.8 0.7 0.6 0 30 40 50 60 70 80 CPU Running Time in s Figure : TDT daase: Average AUC vs. running ime for differen values of dicionary sizes k in Algorihm ONLINE. The poins ploed from lef o righ are for k = 50, 00, 50, and 00. References [] S. Boyd, N. Parikh, E. Chu, B. Peleao, and J. Ecksein. Disribued Opimizaion and Saisical Learning via he Alernaing Direcion Mehod of Mulipliers. Foundaions and Trends in Machine Learning, 0. [] P. Combees and J. Pesque. Proximal Spliing Mehods in Signal Processing. arxiv:09.35, 009. [3] J. Friedman, T. Hasie, H. Hfling, and R. Tibshirani. Pahwise Coordinae Opimizaion. The Annals of Applied Saisics,, 007. [4] S. P. Kasiviswanahan, P. Melville, A. Banerjee, and V. Sindhwani. Emerging Topic Deecion using Dicionary Learning. In CIKM, pages 745 754, 0. [5] R. T. Rockafellar and R. J.-B. Wes. Variaional Analysis. Springer-Verlag, 004. [6] J. Yang and Y. Zhang. Alernaing Direcion Algorihms for L-Problems in Compressive Sensing. SIAM Journal of Scienific Compuing, 33:50 78, 0.