Online Bayesian Passive-Agressive Learning International Conference on Machine Learning, 2014 Tianlin Shi Jun Zhu Tsinghua University, China 21 August 2015 Presented by: Kyle Ulrich
Introduction Online Passive-Aggressive learning (Crammer et al., 2006) provides online large-margin discriminative methods However, this method provides a point estimate, eliminating the ability to include rich Bayesian structures (Hjort, 2010) (Teh et al., 2006) (Ghahramani & Griffiths, 2005) Maximum entropy discrimination (Jaakkola et al., 1999) conjoins max-margin learning and Bayesian generative models Bayesian Passive-Aggressive learning is proposed as an online learning method for Bayesian max-margin models
Online Passive-Aggressive Learning At time t, observe data x t and response y t { 1, 1} Consider a binary classifier parameterized by w The hinge loss: l ɛ (w; x t, y t ) = (ɛ y t w T x t ) + The PA update rule (Crammer et al., 2006) w t+1 = arg min w has two behaviors: 1 2 w w t 2 s.t.: l ɛ (w; x t, y t ) = 0 (1) 1 Passively assign w t+1 = w t if l ɛ (w; x t, y t ) = 0, or 2 Aggressively project w t into the feasible zone of zero loss
Online Passive-Aggressive Learning The alternative Lagrangian has the form w t+1 = arg min w Simple update rules may be derived: 1 2 w w t 2 + 2cl ɛ (w; x t, y t ) (2) l t ɛ = (ɛ y t w T t x t ) + c t = lt ɛ 2 x t 2 w t+1 = w t + c t y t x t Allowing for soft-margin constraints provides flexibility for inseparable training examples
Maximum Entropy Discrimination MED is designed for discriminative tasks such as classification or anomaly detection (Jaakkola et al., 1999) w is a random variable with prior p(w) A distribution q (w) is found from min KL[q(w) p(w)] s.t. E q(w)[l ɛ (w; x t, y t )] ζ, t (3) q(w) Support Vector Machines belong to this class of models
Relationship to Bayesian Inference MED resembles Bayesian inference as an optimization problem Bayes rule provides the solution to min KL[q(w) p(w)] E q(w)[log p(x w)] (4) q(w) and mean-field variational inference provides the constraint q(w) = i q(w i ) Therefore, MED introduces a psuedo-likelihood as a discriminative model instead of the likelihood in the Bayesian generative model These functions may not have a probabilistic interpretation
Online Bayesian Passive-Aggressive Learning Alternatively, consider sequentially updating a new posterior distribution q t+1 (w): min q(w) F t KL[q(w) q t (w)] E q(w) [log p(x t w)] s.t.: l ɛ (q(w); x t, y t ) = 0 (5) This update has two behaviors: 1 Passively pass the posterior q t+1 (w) q t (w)p(x t w) under no loss, l ɛ (q(w); x t, y t ) = 0 2 Agressively project the posterior to a feasible zone of zero loss When no likelihood is defined (p(x t w) is independent of w), passively pass q t+1 (w) = q t (w)
Online Bayesian Passive-Aggressive Learning The Lagrangian of the optimization problem, q t+1 (w) = arg min L(q(w)) + 2cl ɛ (q(w); x t, y t ) (6) q(w) F t For max-margin classifiers, consider two loss functions Averaging classifier: Gibbs classifier: l Avg ɛ (q(w); x t, y t ) = (ɛ y t E q(w) [w T x t ]) + (7) l Gibbs ɛ (q(w); x t, y t ) = E q(w) [(ɛ y t w T x t ) + ] (8) The expected hinge loss is an upper bound of the average loss: l Gibbs ɛ l Avg ɛ
Update Rule From Bayes rule with the Gibbs classifier, the solution is q t+1 (w) q t (w) p(x }{{} t w) exp( 2c(ɛ y }{{} t w T x t ) + ) }{{} Prior Likelihood Pseudo-likelihood (9)
Mini-Batches Observe a mini-batch of data points at time t Data X t = {x d } d Bt Labels Y t = {y d } d Bt The loss functions are simply appended: q t+1 (w) = arg min L(q(w)) + 2c l ɛ (q(w); X t, Y t ) q F t }{{} = d Bt lɛ(q(w);x d,y d ) (10)
Latent Structures Bayesian models are extensively developed with global-local structures Global variables M share properties across the dataset Local variables H t = {h d } d Bt characterize each X t In general, the posterior q t+1 (w, M, H t ) is the solution to where min L(q(w, M, H t )) + 2cl ɛ (q(w, M, H t ); X t, Y t ) (11) q F t L(q) = KL[q q t (w, M)p 0 (H t )] E q [log p(x t w, M, H t )] To decouple global/local structures in posterior, consider mean field assumptions: q(w, M, H t ) = q(w)q(m)q(h t ), q t+1 (w, M) = q (w)q (M)
Latent Dirichlet Allocation Model Consider the LDA model (Blei et al., 2003) θ d Dir(α) φ k Dir(γ) z di Mult(θ d ) x di Mult(φ zdi ) x di is the i-th word in document d z di is the topic assigned to this word θ d is a distribution of topics for document d φ k is a distribution over the W -word vocabulary for topic k Topic assignments should be predictive of class labels y d The online BayesPA objective for MedLDA (Zhu et al., 2012): min q KL [q(w, Φ, Z t ) q t (w, Φ)p 0 (Z t )p(x t Φ, Z t )] + 2c d B t l ɛ (q(w); z d, y d ) (12) An efficient algorithm for online MedLDA is derived
Latent Dirichlet Allocation Model (Some Details) The pseudo-likelihood is ψ(y d z d, w) = exp( 2c(ζ d ) + ), ζ d = ɛ y d f (w, z d ) A data augmentation scheme is used (Zhu et al., 2013), where ψ(y d z d, w) = 0 ψ(y d, λ d z d, w)dλ d ψ(y d, λ d z d, w)dλ d = (2πλ d ) 1/2 exp( (λ d + cζ d ) 2 2λ d ) The following optimization algorithm may now be solved with iterative updates: min L(q(w, Φ, Z t, λ t )) E q [log ψ(y t, λ t Z t, w)] (13) q P
Hierarchical Dirichlet Process The authors extend to a nonparametric latent variable model The HDP has the alternative construction on topic mixing proportions (Teh et al., 2006) (Wang & Blei, 2012): θ d Dir(απ), π k = π k (1 π i ), π k Beta(1, γ) (14) i<k The authors propose a solution to min q P L(q(w, π, Φ, H t)) E q [log ψ(y t, λ t Z t, w)] (15) Details may be found in the paper
Experiment Multi-class classification on the 20 Newsgroup dataset 11,269 training documents; 7,505 test documents One-vs-all strategy for multi-class classifcation Comparisons Passive-Aggressive algorithms pamedlda and pamedhdp Batch counterparts bmedlda and bmedhdp Sparse inference for LDA splda Truncation-free online variational HDP tfhdp MED parameters: ɛ = 164, c = 1
Error vs. Passes Through Data
Accuracy and Training Time
Error vs. Training Time