Online Bayesian Passive-Aggressive Learning"

Online Bayesian Passive-Aggressive Learning" Tianlin Shi! stl501@gmail.com! Jun Zhu! dcszj@mail.tsinghua.edu.cn!

The BIG DATA challenge" Large amounts of data.! Big data:!! Big Science: 25 PB annual data.! Streaming data.! Image Courtesy: h0p://robo4c- rodents.com/ Complex data: text, images, genomic, etc.!

Online Learning" n Batch Learning" Data Learning Algorithm" Loss "L Model 1. Data may come in as a stream. 2. We don t have memory/4me to compute it! There is redundancy in data.

Online Learning" n Online Learning" Data instantaneous Loss Online" Learning Algorithm" Loss "L Predic4on (Supervised Case) Model Update

! Online Passive-Aggressive Learning Crammer et al. 2006 [1]" Online update of a large-margin classifier.! SVM weight! w {xt } t 1 {y t } t 1 From sequential data and labels.! Close-form update rule!! Drawback:!! 1. Limited model complexity.! 2. Single estimate of the model.!

Bayesian models" Flexibility! Can be non-parametric! e.g. Infinite number of components in a topic model.! " " " " " " " "Teh et al. HDP. JASA 2006.! Posterior inference is challenging!! Both VB and MCMC can be expensive in big data.! Attempts to speed up the inference:! Online LDA. Hoffman et al. NIPS 2010! Online Sparse Stochastic Inference. Mimno et al. ICML 2012! Stochastic Gradient Fisher Score. Ahn et al. ICML 2012.! Typically are lack of discriminative ability.!

Max-Margin Bayesian Models" MED: Max-entropy discrimination. Jaakkola et al. 1999! MED with latent variables.! MedLDA. Zhu el al. JMLR 2012.! MED with nonparametric Bayesian inference.! M3F: Max-Margin Matrix Factorization. Xu el at. NIPS 2012.! Posterior inference remains a big challenge!"

Online Bayesian Passive-Aggressive Learning (BayesPA) "

Outline" General formulation! Online max-margin topic models! Experiment! Future work!

Outline" General formulation Online max-margin topic models! Experiment! Future work!

Online PA Algorithms" Update weight: w t+1 = min w w w t 2 s.t.:l ε (w; x t, y t ) = 0. Case I (Passive Update): l ε = 0 Online BayesPA Learning" Update distribu4on of weight: q t+1 (w) = min KL[q(w) q t (w)] E q [log p(x t w)] q F t s.t.:l ε (q(w); x t, y t ) = 0. Case I (Passive Update): l ε = 0 w t+1 = w t q t (w) q t+1 (w) feasible zone. Case II (Aggressive Update): l ε > 0 feasible zone. Case II (Aggressive Update): l ε = 0 q t+1 (w) w t feasible zone. q t (w) feasible zone.

Online PA Algorithms" Online BayesPA Learning" Soft-margin Constraints:" w t+1 = q t+1 (w) = Soft-margin Constraints:" min w w w t 2 + 2l ε (w; x t, y t ). Loss l ε " Hinge loss for classification." max(0,ε y t w x t ) Epsilon-insensitive loss for regression." max(0, w x t y t ε) Close-form update rule:" w t+1 = w t + y t τ t x t, min KL[q(w) q t (w)p(x t w)] q F t +2l ε (q(w); x t, y t ). We focus on classifiers for now." Averaging classifiers" Gibbs Classifiers, draw sample w ~ q(w) " and predict " τ t = min(c, l ε x t 2 ) where" (x) + = max(0, x)

Lemma 1. The expected hinge loss hinge loss! l ε Bayes l ε Gibbs l ε Gibbs l ε Bayes is an upper bound of the Proof. Straightforward: convexity of " (x) +

Lemma 2. If q 0 (w) = N (0, I), F t = P and we use averaging classifier, the non-likelihood BayesPA subsumes the online PA.! Non-likelihood! BayesPA:! min q F t KL[q(w) q t (w)] E q [log p(x t w)] s.t.:l ε (q(w); x t, y t ) = 0.

Proof Sketch. min KL[q(w) q t (w)]+ 2cmax(0,ε y t E q [w x t ]) q(w) P Conjugacy." (Zhu et al. 2012 RegBayes) For feature function ψ and convex function g,! min KL[q(M) p(m,d)]+ g(e q [ψ (M)]) q(m) P = max log φ M Where the optimal solution is! p(m,d)exp( φ,ψ (M) ) g * ( φ) q(m) p(m,d)exp( φ *,ψ (M) )

min KL[q(w) q t (w)]+ 2cmax(0,ε y t E q [w x t ]) q(w) P = max logγ(τ ) I[0 τ c] τ Where " q * (w) = 1 Γ(τ ) q t (w)exp(τ (y t w x t ε)) Use induction, assume " q t (w) = N (w;µ t,σ 2 I) Initial Case" q 0 (w) = N (w;0,σ 2 I) So" q t+1 (w) exp( 1 2σ 2 w µ t 2 +τ (y t w x t ε)) Dual form:" Primal form:" min τ y t µ t x t 1 τ 2σ τ 2 x 2 t x t + ετ min µ µ µ t 2 2σ 2 + cmax(0,ε y t µ x t )

Lemma 3. If F t = P BayesPA is! and we use Gibbs classifier, the update rule of q t+1 (w) q t (w)p(x t w)e 2c(ε y tw x t ) + Prior" Likelihood" Pseudo-likelihood"

Extension: Learning with Mini-Batch" At 4me t, we have incoming batch B t where min q F t KL[q(w) q t (w)p(x t w t )]+ 2l ε (q(w); X t,y t ). X t = {x t } t Bt Y t = {x t } t Bt and l ε (q(w); X t,y t ) = l ε (q(w); x d, y d ) d B t

Extension: Learning with Latent Structures" Data" Classifier weight" w x 5 x 4 x 3 x 2 x 1 h 5 h 4 h 3 h 2 h 1 Latent Structure " H Model M

Extension: Learning with Latent Structures" Uncertainty in H t "" " "à Infer H together with M,w t via BayesPA rule." min q F t KL[q(w,M,H t ) q t (w,m)p(x t w,m,h t )]+ 2l ε (q(w,m,h t ); x t, y t ). " But how can we obtain" q t+1 "(w,m) " "?" Marginalize " q(w,m,h " " t )" " " " " " " " "à Intractable" Mean-Field Assumption! " " q(w,m,h t ) = q(w)q(m)q(h t ) Solve the objective and use " q t+1 (w,m) = q * (w)q * (M) "

Outline" General formulation! Online max-margin topic models" Experiment! Future work!

Batch MedLDA" n Graphical Interpretation" β φ k K α θ d z di x di n d y d D w v 2

For each topic k = 1, 2,, K" φ k ~ Dir(β) w k ~ N (w k ;0,v 2 ) For each document d = 1, 2,, D" θ d ~ Dir(α ) For each i-th word in document d" z di ~ Multi(θ d ) x di ~ Multi(Φ zdi ) Predict" f (w,z d ) = w z d where" z dk = 1 I[z di = k] n d i

Batch MedLDA" n Inference of LDA" Let" Φ = {φ k } K k=1,θ = {θ d } D d=1,z = {z d } D D d=1, X = {x d } d=1 LDA infers posterior " Or equivalently solves" p(φ,θ,z X) p 0 (Φ,Θ,Z)p(X Z,Φ) min KL[q(Φ,Θ,Z) p(φ,θ,z X)] q P

Batch MedLDA" n Inference of MedLDA" Inference problem:" min KL[q(Φ,Θ,Z) p(φ,θ,z X)]+ 2 l ε(q(w,z d ); x d, y d )} q P Define a prediction model:" Loss function:" f (w,z d ) = w z d where" z dk = 1 I[z di = k] n d Averaging loss: " l Avg ε (q(w,z d ); x d, y d ) = (ε y d E q [ f (w,z d )]) + Gibbs loss:" l Gibbs ε (q(w,z d ); x d, y d ) = E q [(ε y d f (w,z d )) + ] D d=1 i

Online MedLDA " Recall BayesPA with latent structures.! min q F t KL[q(w,M,H t ) q t (w,m)p(x t w,m,h t )]+ 2l ε (q(w,m,h t ); x t, y t ). In MedLDA, we have M = Φ,H t = (Θ t,z t )! But to reduce parameter space, we collapse out! Pr[Z d α ] = Pr[Z d θ d ]Pr[θ d α ]dθ d = D(α + C d ), d B t θd D(α ) So! M = Φ,H t = Z t Exact inference is hard!! " " "à Mean-field assumption! Θ t q(w,φ,z t ) = q(w)q(φ)q(z t )

! Online MedLDA with Gibbs classifiers" By Lemma 3, the optimal solution has form! "where! q t+1 (w,φ,z t ) q t (w,φ)p 0 (Z t α )p(x t Z t,m)ψ (Y t w,z t ) ψ (Y t w,z t ) = ψ (Y d w,z d ) d B t ψ (y d w,z d ) = e 2c(ε y d w z d ) + Looks not friendly!"

Lemma: Scale of Mixture (Zhu el al. 2013) The pseudo-likelihood can be expressed as! ψ (y d w,z d ) = λd =0 1 exp( (λ + cξ d d )2 )dλ d 2πλ d 2λ d Let! ψ (Y t,λ t w,z t ) = ψ (Y d,λ d w,z d ) ψ (y d,λ d w,z d ) = d B t 1 exp( (λ + cξ d d )2 ) 2πλ d 2λ d So our posterior at round t can be expressed as! q t+1 (w,φ,z t ) q t (w,φ)p 0 (Z t α )p(x t Z t,φ)ψ (Y t,λ t w,z t ) Again, mean-field assumption:! q(w,φ,z t,λ t ) = q(w)q(φ)q(z t,λ t )

Fix q(z t,λ t ) Online Gibbs MedLDA: Global Update" with the mean-field assumption, obviously" If initially" Then we have the update rule"

Fix q(w,φ) Online Gibbs MedLDA:, we have! Local Update" 1 q(z t,λ t ) p 0 (Z t ) exp( 2πλ Λ E zdi,x q(φ,w) [ (λ + cξ d d )2 ]) di d i [n d ] 2λ d where! d B t * E q(φ) [log(φ zdi,x di )] = Ψ(Δ zdi,x di ) Ψ( Δ * z di,x ) x But hard to evaluate the expectation using the above formula.!! No close form! Z has a large number of combinations! Gibbs Sampling!"

Online Gibbs MedLDA: Gibbs Sampling" For! Z t For! λ t λ d 1 follows inverse Gaussian distribution:! 1 λ 1 d ~ IG(λ 1 d ; c ξ 2 d + z d Σ * z )

Nonparametric Extension" n MED Hierarchical Dirichlet Process (MedHDP)" Stick Breaking Process" β φ k k = 1,2,..., π k θ d z di x di k = 1,2,..., n d y d D w Gaussian Process"

Nonparametric Extension" n Stick Breaking Process" π 1 π 2 π 3. n Generate Topic Portion" π 1 π 2 π 3. θ d ~ Dir(απ )

MedHDP" Draw topic portion from, the rest is the same as LDA." " Inference " π min q P KL[q(w,π,Φ,Θ,Z) p(w,π,φ,θ,z X)]+ 2c l ε (q(w,z d ); x d, y d )) Loss function is almost the same as LDA, expect for prediction rule." f (w,z d ) = The term is essentially finite." k=1 w k z dk D d=1

! Online Nonparametric MedLDA" Recall BayesPA with latent structures.! min q F t KL[q(w,M,H t ) q t (w,m)p(x t w,m,h t )]+ 2l ε (q(w,m,h t ); x t, y t ). In MedHDP, we! Θ Collapse out.! Introduce an auxiliary latent variable! "We can show! p(s d,z d π ) S(n d z dk,s dk )(απ k ) s dk Exact inference is hard à Mean-field assumption k=1 p(z d π ) = p(s d,z d π ) s d

!! Online Nonparametric MedLDA Global Update" For For Φ,w, same update rule as in online MedLDA.! π, by mean-field assumption,! If initially,! By induction,! q(π k ) = Beta(u k 0,v k 0 ) where the update rule is! u k * = u k t v k * = v k t + E q [s dk ] d B t + E q [ s dj ] d B t j>k

Online Nonparametric MedLDA Local Update" Fix global distribution,! Φ!q(Z t,s t ) = exp(e q(φ)q(π ) [log(p(x t Φ,Z t ) + log(z t,s t π )])!q(z t,λ t ) = exp(e q [logψ (Y t,λ t w,z t )]) But has infinite number of components! à Solution:! Borrow ideas from Wang & Blei NIPS 2012, approximate! π Z t,s t,λ t sample together with,! use the direct sampling scheme for HDP. Teh et al. HDP. JASA 2006.!!!

Online Nonparametric MedLDA Gibbs Sampling" For! Z t λ t For, the same as in online MedLDA.! For! π k a k = u k * + b k = v k * + d B t s dk d B t j>k s dj

Outline" General formulation! Online max-margin topic models! Experiment Future work!

Classification on 20NG" 20Newsgroup! 20 Categories of documents.! Training/testing split: 11269/7505.! Test online MedLDA (pamedlda)! " "and online MedHDP (pamedhdp).! Compare with! Batch counterparts.! Gibbs MedLDA. (Zhu et al. ICML 2013).! Topic model + SVM. Sparse Stochastic LDA (mimno et al. ICML 2012), truncation-free HDP (Wang & Blei NIPS 2012).!

Sensitivity with Batch Size"

Sensitivity with Iterations and Samples"

Multi-Task Classification" Extend our algorithm to multi-task learning.! Label 1 y d 1 x d Label 2 y d 2 Label 2 y d T Simply solve!

Multi-Task Classification" 1.1 M wikipedia dataset.! 20 kinds of label, not necessarily exclusive! Training/Testing split: 1.1 M / 5 K! F1 score:! 2 precision recall precision + recall

Future Work" Theoretical analysis of BayesPA.! Parallel asynchronous BayesPA learning.! BayesPA learning for regression problems.!

Reference"

Thank you."