Online Bayesian Passive-Aggressive Learning

Similar documents
Online Bayesian Passive-Agressive Learning

Online Bayesian Passive-Aggressive Learning"

Online Bayesian Passive-Aggressive Learning

Small-variance Asymptotics for Dirichlet Process Mixtures of SVMs

Classical Predictive Models

Machine Learning for NLP

Online Passive-Aggressive Algorithms

Online Passive-Aggressive Algorithms

Topic Models. Brandon Malone. February 20, Latent Dirichlet Allocation Success Stories Wrap-up

Collapsed Variational Inference for Sum-Product Networks

Online Passive-Aggressive Algorithms. Tirgul 11

Machine Learning 2017

Study Notes on the Latent Dirichlet Allocation

Using Both Latent and Supervised Shared Topics for Multitask Learning

Lecture 13 : Variational Inference: Mean Field Approximation

13 : Variational Inference: Loopy Belief Propagation and Mean Field

Fast Inference and Learning for Modeling Documents with a Deep Boltzmann Machine

Lecture 16: Perceptron and Exponential Weights Algorithm

Posterior Regularization

Littlestone s Dimension and Online Learnability

Bayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan

Diversity-Promoting Bayesian Learning of Latent Variable Models

13: Variational inference II

Sparse Stochastic Inference for Latent Dirichlet Allocation

Probabilistic Time Series Classification

Online multiclass learning with bandit feedback under a Passive-Aggressive approach

The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees

Gaussian Models

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Streaming Variational Bayes

The supervised hierarchical Dirichlet process

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

On the Generalization Ability of Online Strongly Convex Programming Algorithms

Probabilistic Graphical Models

Online Manifold Regularization: A New Learning Setting and Empirical Study

Introduction to Probabilistic Machine Learning

Bayesian Nonparametrics for Speech and Signal Processing

Distributed Estimation, Information Loss and Exponential Families. Qiang Liu Department of Computer Science Dartmouth College

Online Forest Density Estimation

Deep Poisson Factorization Machines: a factor analysis model for mapping behaviors in journalist ecosystem

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

PATTERN RECOGNITION AND MACHINE LEARNING

Least Squares Regression

Machine Learning for NLP

Discriminative Training of Mixed Membership Models

A Unified Posterior Regularized Topic Model with Maximum Margin for Learning-to-Rank

U Logo Use Guidelines

Bayesian Methods for Machine Learning

Non-Parametric Bayes

ICML Scalable Bayesian Inference on Point processes. with Gaussian Processes. Yves-Laurent Kom Samo & Stephen Roberts

Sum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 17, 2017

Online Learning of Probabilistic Graphical Models

Variational Autoencoders

Generative MaxEnt Learning for Multiclass Classification

Better Algorithms for Selective Sampling

Linear Classification

Faster Stochastic Variational Inference using Proximal-Gradient Methods with General Divergence Functions

6.867 Machine Learning

Polyhedral Outer Approximations with Application to Natural Language Parsing

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Recurrent Latent Variable Networks for Session-Based Recommendation

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Support vector machines Lecture 4

Naïve Bayes classification

Efficient Bandit Algorithms for Online Multiclass Prediction

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

STA 4273H: Statistical Machine Learning

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

Least Squares Regression

Nonparametric Bayesian Methods (Gaussian Processes)

Introduction to Machine Learning

An empirical study about online learning with generalized passive-aggressive approaches

Learning, Games, and Networks

The No-Regret Framework for Online Learning

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Statistical Data Mining and Machine Learning Hilary Term 2016

Authors: John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira and Jennifer Wortman (University of Pennsylvania)

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Bayesian Machine Learning

Practical Agnostic Active Learning

Latent Dirichlet Allocation Introduction/Overview

29 : Posterior Regularization

Part 1: Expectation Propagation

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data

STA 4273H: Statistical Machine Learning

Bayesian Learning (II)

Online Passive-Aggressive Algorithms

ECE662: Pattern Recognition and Decision Making Processes: HW TWO

Ad Placement Strategies

Accelerating Stochastic Optimization

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Chapter 16. Structured Probabilistic Models for Deep Learning

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Online Learning Class 12, 20 March 2006 Andrea Caponnetto, Sanmay Das

Data Mining Techniques

Topics. Bayesian Learning. What is Bayesian Learning? Objectives for Bayesian Learning

Transcription:

Online Bayesian Passive-Aggressive Learning Full Journal Version: http://qr.net/b1rd Tianlin Shi Jun Zhu ICML 2014 T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 1 / 35

Outline Introduction Motivation Framework Theory Connection to PA Regret Analysis Practice Extension Latent Space Topic Modeling Experiment T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 2 / 35

Outline Introduction Introduction Motivation Framework Theory Connection to PA Regret Analysis Practice Extension Latent Space Topic Modeling Experiment T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 3 / 35

Introduction The Big Data Challenge Motivation T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 4 / 35

Introduction The Big Data Challenge Motivation Huge amounts of data push the limits of methods/systems. T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 4 / 35

Introduction The Big Data Challenge Motivation Huge amounts of data push the limits of methods/systems. Streaming and potentially infinite data in real applications. T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 4 / 35

Introduction The Big Data Challenge Motivation Huge amounts of data push the limits of methods/systems. Streaming and potentially infinite data in real applications. Complex with latent variables and hierarchical modeling. T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 4 / 35

Introduction Motivation Story I. Online Learning [Shalev-Shwartz 07. Online learning: Theory, algorithms and applications.] Illustration Model& w t Predict& Online&Learner& A Learn& Loss&& l(w ; x, y ) t t t Data& {x t, y t } t=0 T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 5 / 35

Introduction Motivation Story I. Online Learning [Shalev-Shwartz 07. Online learning: Theory, algorithms and applications.] Illustration Setting Online learner A. Model& w t Predict& Online&Learner& A Learn& Loss&& l(w ; x, y ) t t t Data& {x t, y t } t=0 T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 5 / 35

Introduction Motivation Story I. Online Learning [Shalev-Shwartz 07. Online learning: Theory, algorithms and applications.] Illustration Setting Online learner A. Model& w t Predict& Incoming data stream (x 0, y 0 ), (x 1, y 1 ), (x 2, y 2 )... Online&Learner& A Learn& Loss&& l(w ; x, y ) t t t Data& {x t, y t } t=0 T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 5 / 35

Introduction Motivation Story I. Online Learning [Shalev-Shwartz 07. Online learning: Theory, algorithms and applications.] Illustration Setting Online learner A. Model& w t Predict& Incoming data stream (x 0, y 0 ), (x 1, y 1 ), (x 2, y 2 )... At each round t, learner A Online&Learner& A Learn& Loss&& l(w ; x, y ) t t t Data& {x t, y t } t=0 T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 5 / 35

Introduction Motivation Story I. Online Learning [Shalev-Shwartz 07. Online learning: Theory, algorithms and applications.] Illustration Setting Online learner A. Model& w t Predict& Incoming data stream (x 0, y 0 ), (x 1, y 1 ), (x 2, y 2 )... At each round t, learner A 1. predicts w t. Online&Learner& A Learn& Loss&& l(w ; x, y ) t t t Data& {x t, y t } t=0 T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 5 / 35

Introduction Motivation Story I. Online Learning [Shalev-Shwartz 07. Online learning: Theory, algorithms and applications.] Illustration Setting Online learner A. Model& w t Predict& Incoming data stream (x 0, y 0 ), (x 1, y 1 ), (x 2, y 2 )... At each round t, learner A Online&Learner& A Learn& Loss&& l(w ; x, y ) t t t Data& {x t, y t } t=0 1. predicts w t. 2. Incurs loss l(w t ; x t, y t ). T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 5 / 35

Introduction Motivation Story I. Online Learning [Shalev-Shwartz 07. Online learning: Theory, algorithms and applications.] Illustration Setting Online learner A. Model& w t Predict& Incoming data stream (x 0, y 0 ), (x 1, y 1 ), (x 2, y 2 )... At each round t, learner A Online&Learner& A Learn& Loss&& l(w ; x, y ) t t t Data& {x t, y t } t=0 1. predicts w t. 2. Incurs loss l(w t ; x t, y t ). 3. Learn how to predict. T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 5 / 35

Introduction Motivation Online Passive-Aggressive (PA) Algorithms [Crammer et al. 06. Online Passive-Aggressive Algorithms.] T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 6 / 35

Introduction Motivation Online Passive-Aggressive (PA) Algorithms Update weight Obtain w t+1 = [Crammer et al. 06. Online Passive-Aggressive Algorithms.] arg min w 1 2 w w t 2 } {{ } regularizer + 2c l ɛ (w; x t, y t ) }{{} large margin (1) T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 6 / 35

Introduction Motivation Online Passive-Aggressive (PA) Algorithms Update weight Obtain w t+1 = [Crammer et al. 06. Online Passive-Aggressive Algorithms.] arg min w 1 2 w w t 2 } {{ } regularizer + 2c l ɛ (w; x t, y t ) }{{} large margin (1) Illustration T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 6 / 35

Introduction Motivation Online Passive-Aggressive (PA) Algorithms Update weight Obtain w t+1 = [Crammer et al. 06. Online Passive-Aggressive Algorithms.] arg min w 1 2 w w t 2 } {{ } regularizer + 2c l ɛ (w; x t, y t ) }{{} large margin (1) Illustration T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 6 / 35

Introduction Motivation Why online learning works. Notion of Regret Accumulative loss against fixed model w with hindsight of data: T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 7 / 35

Introduction Motivation Why online learning works. Notion of Regret Accumulative loss against fixed model w with hindsight of data: Regret = T 1 t=0 T 1 l(w t ; x t, y t ) t=0 l(w ; x t, y t ) (2) T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 7 / 35

Introduction Motivation Why online learning works. Notion of Regret Accumulative loss against fixed model w with hindsight of data: Regret Bound Regret = T 1 t=0 Example. In PA, l = l 2 ɛ, then T 1 l(w t ; x t, y t ) t=0 l(w ; x t, y t ) (2) T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 7 / 35

Introduction Motivation Why online learning works. Notion of Regret Accumulative loss against fixed model w with hindsight of data: Regret Bound Regret = T 1 t=0 Example. In PA, l = l 2 ɛ, then T 1 t=0 T 1 l(w t ; x t, y t ) l(w t ) cr T 1 t=0 t=0 l(w ; x t, y t ) (2) l(w ) + const (3) T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 7 / 35

Introduction Motivation Why online learning works. Notion of Regret Accumulative loss against fixed model w with hindsight of data: Regret Bound Regret = T 1 t=0 Example. In PA, l = l 2 ɛ, then T 1 t=0 T 1 l(w t ; x t, y t ) l(w t ) cr T 1 Implicitly minimizes the empirical loss. t=0 t=0 l(w ; x t, y t ) (2) l(w ) + const (3) T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 7 / 35

Introduction Motivation Story II. Sequential Bayesian Inference Maintain a distribution over w. T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 8 / 35

Introduction Motivation Story II. Sequential Bayesian Inference Maintain a distribution over w. For round t, T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 8 / 35

Introduction Motivation Story II. Sequential Bayesian Inference Maintain a distribution over w. For round t, 1. Let current distribution q t (w) be prior. T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 8 / 35

Introduction Motivation Story II. Sequential Bayesian Inference Maintain a distribution over w. For round t, 1. Let current distribution q t (w) be prior. 2. For each incoming data point (x t, y t ), apply Bayes theorem q t+1 (w) q t (w)p(x t w). (4) T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 8 / 35

Introduction Motivation Story II. Sequential Bayesian Inference Maintain a distribution over w. For round t, 1. Let current distribution q t (w) be prior. 2. For each incoming data point (x t, y t ), apply Bayes theorem q t+1 (w) q t (w)p(x t w). (4) 3. If inference is intractable, use approximations of q t+1 (w). [Broderick et al. 13. Streaming Variational Bayes.] T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 8 / 35

Introduction Motivation Story II. Sequential Bayesian Inference Maintain a distribution over w. For round t, 1. Let current distribution q t (w) be prior. 2. For each incoming data point (x t, y t ), apply Bayes theorem q t+1 (w) q t (w)p(x t w). (4) 3. If inference is intractable, use approximations of q t+1 (w). [Broderick et al. 13. Streaming Variational Bayes.] Advantage Captures uncertainty. T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 8 / 35

Introduction Motivation Story II. Sequential Bayesian Inference Maintain a distribution over w. For round t, 1. Let current distribution q t (w) be prior. 2. For each incoming data point (x t, y t ), apply Bayes theorem q t+1 (w) q t (w)p(x t w). (4) 3. If inference is intractable, use approximations of q t+1 (w). [Broderick et al. 13. Streaming Variational Bayes.] Advantage Captures uncertainty. Flexible to model the underlying structure. T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 8 / 35

Introduction Motivation Story II. Sequential Bayesian Inference Maintain a distribution over w. For round t, 1. Let current distribution q t (w) be prior. 2. For each incoming data point (x t, y t ), apply Bayes theorem q t+1 (w) q t (w)p(x t w). (4) 3. If inference is intractable, use approximations of q t+1 (w). [Broderick et al. 13. Streaming Variational Bayes.] Advantage Captures uncertainty. Flexible to model the underlying structure. Could be nonparametric. T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 8 / 35

The Gap Introduction Motivation Story I. Online Learning: PA T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 9 / 35

Introduction Motivation The Gap Story I. Online Learning: PA Huge, Stream, T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 9 / 35

Introduction Motivation The Gap Story I. Online Learning: PA Huge, Stream, Complex 1. Single estimate of the model. T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 9 / 35

Introduction Motivation The Gap Story I. Online Learning: PA Huge, Stream, Complex 1. Single estimate of the model. 2. Ignores latent structure underlying complex data. T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 9 / 35

Introduction Motivation The Gap Story I. Online Learning: PA Huge, Stream, Complex 1. Single estimate of the model. 2. Ignores latent structure underlying complex data. Story II. Sequential Bayesian Inference T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 9 / 35

Introduction Motivation The Gap Story I. Online Learning: PA Huge, Stream, Complex 1. Single estimate of the model. 2. Ignores latent structure underlying complex data. Story II. Sequential Bayesian Inference Lack of discriminative power T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 9 / 35

Introduction Motivation The Gap Story I. Online Learning: PA Huge, Stream, Complex 1. Single estimate of the model. 2. Ignores latent structure underlying complex data. Story II. Sequential Bayesian Inference Lack of discriminative power Does not directly minimize loss (or regret). T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 9 / 35

A New Perspective Introduction Motivation Illustration T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 10 / 35

A New Perspective Introduction Motivation Illustration Intuition T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 10 / 35

Introduction Motivation A New Perspective Illustration Intuition Incoming data (x t, y t ) T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 10 / 35

Introduction Motivation A New Perspective Illustration Intuition Incoming data (x t, y t ) Infer latent variable z t from x t. T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 10 / 35

Introduction Motivation A New Perspective Illustration Intuition Incoming data (x t, y t ) Infer latent variable z t from x t. Make prediction ŷ t. T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 10 / 35

Introduction Motivation A New Perspective Illustration Intuition Incoming data (x t, y t ) Infer latent variable z t from x t. Make prediction ŷ t. Learner incurs instantaneous loss. T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 10 / 35

Introduction Motivation A New Perspective Illustration Intuition Incoming data (x t, y t ) Infer latent variable z t from x t. Make prediction ŷ t. Learner incurs instantaneous loss. Learn both how to infer and to predict. T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 10 / 35

Introduction Framework Bayesian Passive-Aggressive Learning (BayesPA) T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 11 / 35

Introduction Framework Bayesian Passive-Aggressive Learning (BayesPA) 1. At round t, the learner makes a prediction by giving q t (w). T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 11 / 35

Introduction Framework Bayesian Passive-Aggressive Learning (BayesPA) 1. At round t, the learner makes a prediction by giving q t (w). 2. Incurs instantaneous loss l, both T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 11 / 35

Introduction Framework Bayesian Passive-Aggressive Learning (BayesPA) 1. At round t, the learner makes a prediction by giving q t (w). 2. Incurs instantaneous loss l, both Bayesian Log-loss: E qt(w)[log p(x t w)] T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 11 / 35

Introduction Framework Bayesian Passive-Aggressive Learning (BayesPA) 1. At round t, the learner makes a prediction by giving q t (w). 2. Incurs instantaneous loss l, both Bayesian Log-loss: E qt(w)[log p(x t w)] Decision-theoretic loss: l(q(w); xt, y t ) T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 11 / 35

Introduction Framework Bayesian Passive-Aggressive Learning (BayesPA) 1. At round t, the learner makes a prediction by giving q t (w). 2. Incurs instantaneous loss l, both Bayesian Log-loss: E qt(w)[log p(x t w)] Decision-theoretic loss: l(q(w); xt, y t ) 3. Learning rule q t+1 (w) = [ ] [ ] ) argmin KL q(w) q t (w) E q(w) log p(x t w) + 2c l ɛ (q(w); x t, y t, q(w) F t } {{ }} {{ } regularizer instantaneous loss (5) T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 11 / 35

Interpretation Introduction Framework Rewrite the learning rule q t+1 (w) = arg max q(w) F KL[q(w) q t(w)p(x t w) }{{} unnormalized posterior ] + 2c l(q t (w); x t, y t ) (6) T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 12 / 35

Interpretation Introduction Framework Rewrite the learning rule q t+1 (w) = arg max q(w) F KL[q(w) Illustration q t(w)p(x t w) }{{} unnormalized posterior ] + 2c l(q t (w); x t, y t ) (6) T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 12 / 35

Interpretation Introduction Framework Rewrite the learning rule q t+1 (w) = arg max q(w) F KL[q(w) Illustration q t(w)p(x t w) }{{} unnormalized posterior ] + 2c l(q t (w); x t, y t ) (6) T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 12 / 35

Introduction Framework BayesPA classifiers: two types 1. Averaging classifiers Prediction rule ŷ t = sign E q(w) [w x t ] Averaged hinge loss ) (q(w); x t, y t = ( [ ɛ y t E q(w) w ]) x t l ave ɛ +. T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 13 / 35

Introduction Framework BayesPA classifiers: two types 1. Averaging classifiers Prediction rule ŷ t = sign E q(w) [w x t ] Averaged hinge loss ) (q(w); x t, y t = ( [ ɛ y t E q(w) w ]) x t 2. Gibbs classifiers l ave ɛ First randomly draw w qt (w) and decide ŷ t = sign(w x t ). Expected hinge loss ) [ (ɛ (q(w); x t, y t = E q(w) yt w ) ] x t. l gibbs ɛ +. + T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 13 / 35

Introduction Framework BayesPA classifiers: two types 1. Averaging classifiers Prediction rule ŷ t = sign E q(w) [w x t ] Averaged hinge loss ) (q(w); x t, y t = ( [ ɛ y t E q(w) w ]) x t 2. Gibbs classifiers l ave ɛ First randomly draw w qt (w) and decide ŷ t = sign(w x t ). Expected hinge loss ) [ (ɛ (q(w); x t, y t = E q(w) yt w ) ] x t. l gibbs ɛ +. + Lemma ) l gibbs ɛ (q(w); x t, y t ) l ave ɛ (q(w); x t, y t. (7) T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 13 / 35

Update Rule Introduction Framework Averaging Classifier ) q t+1 (w) q t (w) p(x }{{} t w) exp (τt y }{{} t w x t, (8) }{{} prior likelihood pseudo likelihood Dual slack variable τ t enforces max-margin constraints. T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 14 / 35

Update Rule Introduction Framework Averaging Classifier ) q t+1 (w) q t (w) p(x }{{} t w) exp (τt y }{{} t w x t, (8) }{{} prior likelihood pseudo likelihood Dual slack variable τ t enforces max-margin constraints. Gibbs Classifier ( ) ) q t+1 (w) q t (w) p(x }{{} t w) exp 2c (ɛ y }{{} t w x t + prior likelihood }{{} pseudo likelihood (9) T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 14 / 35

Outline Theory Introduction Motivation Framework Theory Connection to PA Regret Analysis Practice Extension Latent Space Topic Modeling Experiment T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 15 / 35

Connection to PA Theory Connection to PA Theorem BayesPA subsumes the online PA. T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 16 / 35

Connection to PA Theory Connection to PA Theorem BayesPA subsumes the online PA. Proof. Show µ t 1 µ t follows PA. T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 16 / 35

Theory Regret Analysis Regret Analysis Regret in BayesPA Decision is the distribution q(w). Cumulative loss against fixed distribution p (w) with hindsight of all data. T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 17 / 35

Theory Regret Analysis Regret Analysis Regret in BayesPA Decision is the distribution q(w). Cumulative loss against fixed distribution p (w) with hindsight of all data. Regret = T 1 t=0 T 1 l(q t (w); x t, y t ) t=0 l(p (w); x t, y t ) (10) T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 17 / 35

Theory Regret Analysis Regret Analysis Regret in BayesPA Decision is the distribution q(w). Cumulative loss against fixed distribution p (w) with hindsight of all data. where the loss Regret = T 1 t=0 T 1 l(q t (w); x t, y t ) t=0 l(p (w); x t, y t ) (10) l(q(w); x t, y t ) = E q(w) [log p(x t w)] + 2c l ɛ (q(w t ); x t, y t ) (11) T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 17 / 35

Theory A regret bound for BayesPA Regret Analysis T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 18 / 35

Theory A regret bound for BayesPA Regret Analysis T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 18 / 35

Theory Regret Analysis A regret bound for BayesPA qt τ,u (w) q t (w) exp(uu(w, x t, y t ) + τt (w, x t, y t )). where the sufficient statistics, U(w, x t, y t ) = log p(x t w), and T (w, x t, y t ) = y t w x t. T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 18 / 35

Theory Regret Analysis A regret bound for BayesPA (Contd.) Theorem Suppose Fisher information J t λ t I, Regret KL[p (w) p 0 (w)] + 1 T 2 (1 + 1 4c2 ) λ t (12) t=0 T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 19 / 35

Theory Regret Analysis A regret bound for BayesPA (Contd.) Theorem Suppose Fisher information J t λ t I, Regret KL[p (w) p 0 (w)] + 1 T 2 (1 + 1 4c2 ) λ t (12) t=0 Remark. If Bayesian CLT holds, BayesPA achieve optimal regret O(log T ). T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 19 / 35

Outline Practice Introduction Motivation Framework Theory Connection to PA Regret Analysis Practice Extension Latent Space Topic Modeling Experiment T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 20 / 35

Some Extensions Practice Extension Mini-batches X t = {x d } d Bt, Y t = {y d } d Bt. [ ] [ ] ) min KL q(w) q t (w) E q(w) log p(x t w) + 2c l q F t }{{} ɛ (q(w); X t, Y t, }{{} log p(x t w) d Bt l ɛ(q(w);x t,y t) d Bt T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 21 / 35

Some Extensions Practice Extension Mini-batches X t = {x d } d Bt, Y t = {y d } d Bt. [ ] [ ] ) min KL q(w) q t (w) E q(w) log p(x t w) + 2c l q F t }{{} ɛ (q(w); X t, Y t, }{{} Multi-Task Learning Multiple labels (x t, y 1 t,..., y τ t ). min q F t log p(x t w) d Bt [ ] KL q(w) q t (w)p(x t w, M, H t ) + 2c l ɛ(q(w);x t,y t) d Bt T ) l ɛ (q(w, M, H t ); X t, Y τ t τ=1 T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 21 / 35

Practice BayesPA with Latent Space Latent Space Illustration global&variables w local&variables H t (a) sampling analysis m draw&a&mini5 batch infer&the&hidden& structure t = 1,2,..., X t,y t (X t,y t ) q * (H t ) (b) up glo T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 22 / 35

Practice BayesPA with Latent Space Latent Space Illustration global&variables w Formulation local&variables Maintain distribution q(w, M) of global variables. H t (a) sampling analysis m draw&a&mini5 batch infer&the&hidden& structure t = 1,2,..., X t,y t (X t,y t ) q * (H t ) [ ] ) min KL q(w, M, H t) q t(w, M)p(x t w, M, H t) + 2c l ɛ (q(w, M, H t); X t, Y t, (13) q F t (b) up glo T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 22 / 35

Practice BayesPA with Latent Space Latent Space Illustration global&variables w Formulation local&variables Maintain distribution q(w, M) of global variables. H t (a) sampling analysis m draw&a&mini5 batch infer&the&hidden& structure t = 1,2,..., X t,y t (X t,y t ) q * (H t ) [ ] ) min KL q(w, M, H t) q t(w, M)p(x t w, M, H t) + 2c l ɛ (q(w, M, H t); X t, Y t, (13) q F t Problem Efficient inference for marginal q(w, M)? T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 22 / 35 (b) up glo

Practice Latent Space Approximate Inference in BayesPA Variational Inference Idea: restrict F t to be tractable. T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 23 / 35

Practice Latent Space Approximate Inference in BayesPA Variational Inference Idea: restrict F t to be tractable. sampling analysis model&update draw&a&mini5 batch infer&the&hidden& structure (b) update&distribu8on&of& global&variables (X t,y t ) q * (H t ) q* (w,m) 1. Structured mean field q (w, M t, H t ) = q (w)q (M t ) q(h t ). T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 23 / 35

Practice Latent Space Approximate Inference in BayesPA Variational Inference Idea: restrict F t to be tractable. sampling analysis model&update draw&a&mini5 batch infer&the&hidden& structure (b) update&distribu8on&of& global&variables (X t,y t ) q * (H t ) q* (w,m) 1. Structured mean field q (w, M t, H t ) = q (w)q (M t ) q(h t ). 2. Solve BayesPA problem. T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 23 / 35

Practice Latent Space Approximate Inference in BayesPA Variational Inference Idea: restrict F t to be tractable. sampling analysis model&update draw&a&mini5 batch infer&the&hidden& structure (b) update&distribu8on&of& global&variables (X t,y t ) q * (H t ) q* (w,m) 1. Structured mean field q (w, M t, H t ) = q (w)q (M t ) q(h t ). 2. Solve BayesPA problem. 3. Output q t+1 (M, w) = q (w)q (M). T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 23 / 35

Practice Topic Modeling Application to Topic Modeling Review of Maximum Entropy Discriminant LDA (MedLDA) T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 24 / 35

Practice Topic Modeling Online Max-Margin Topic Models Online MedLDA For MedLDA, latent BayesPA concretizes to [ ] KL q(w, Φ, Z t ) q t (w, Φ)p 0 (Z t )p(x t Φ, Z t ) min q,ξ d + 2c d B t l ɛ (w; z t, y t ), (14) s.t. q(w, Φ, Z t ) P. T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 25 / 35

Practice Topic Modeling Online Max-Margin Topic Models Online MedLDA For MedLDA, latent BayesPA concretizes to [ ] KL q(w, Φ, Z t ) q t (w, Φ)p 0 (Z t )p(x t Φ, Z t ) min q,ξ d + 2c d B t l ɛ (w; z t, y t ), (14) In our paper s.t. q(w, Φ, Z t ) P. Derive efficient algorithms for online MedLDA with both l Avg ɛ, l Gibbs ɛ, T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 25 / 35

Practice Topic Modeling Online Max-Margin Topic Models Online MedLDA For MedLDA, latent BayesPA concretizes to [ ] KL q(w, Φ, Z t ) q t (w, Φ)p 0 (Z t )p(x t Φ, Z t ) min q,ξ d + 2c d B t l ɛ (w; z t, y t ), (14) In our paper s.t. q(w, Φ, Z t ) P. Derive efficient algorithms for online MedLDA with both l Avg ɛ, l Gibbs ɛ, Provided nonparametric extension: Med Hierarchical Dirichlet Process (MedHDP). T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 25 / 35

Practice Topic Modeling Online Max-Margin Topic Models Online MedLDA For MedLDA, latent BayesPA concretizes to [ ] KL q(w, Φ, Z t ) q t (w, Φ)p 0 (Z t )p(x t Φ, Z t ) min q,ξ d + 2c d B t l ɛ (w; z t, y t ), (14) In our paper s.t. q(w, Φ, Z t ) P. Derive efficient algorithms for online MedLDA with both l Avg ɛ, l Gibbs ɛ, Provided nonparametric extension: Med Hierarchical Dirichlet Process (MedHDP). BayesPA for nonparametric models: derive efficient algorithms for online MedHDP. T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 25 / 35

Outline Experiment Introduction Motivation Framework Theory Connection to PA Regret Analysis Practice Extension Latent Space Topic Modeling Experiment T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 26 / 35

Experiment Classification on 20NG 20 Newsgroup 1. 20 categories of documents. T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 27 / 35

Experiment Classification on 20NG 20 Newsgroup 1. 20 categories of documents. 2. training/testing split 11269/7505. T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 27 / 35

Experiment Classification on 20NG 20 Newsgroup 1. 20 categories of documents. 2. training/testing split 11269/7505. 3. one vs. all for multiple categories. T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 27 / 35

Experiment Classification on 20NG 20 Newsgroup 1. 20 categories of documents. 2. training/testing split 11269/7505. 3. one vs. all for multiple categories. Baseline batch MedLDA (Zhu el al. 2009, 2013) batch MedHDP. T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 27 / 35

Experiment Classification on 20NG 20 Newsgroup 1. 20 categories of documents. 2. training/testing split 11269/7505. 3. one vs. all for multiple categories. Baseline batch MedLDA (Zhu el al. 2009, 2013) batch MedHDP. sparse stochastic LDA (mimno et al. 2012), truncation-free HDP (Wang & Blei 2012). T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 27 / 35

Experiment Results of Online MedLDA Passes through dataset. 1 pamedlda ave pamedlda gibbs MedLDA gibbs gmedlda splda+svm 0.9 0.8 0.7 Error Rate 0.6 0.5 0.4 0.3 0.2 10 1 10 0 10 1 #Passes Through the Dataset T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 28 / 35

Experiment Results of Online MedLDA Accuracy and running time pamedlda ave pamedlda gibbs MedLDA gibbs gmedlda splda+svm 0.8 10 5 0.75 10 4 0.7 Accuracy 0.65 0.6 Time(s) 10 3 0.55 0.5 10 2 0.45 0.4 20 40 60 80 100 #Topic 10 1 0 20 40 60 80 100 120 #Topic T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 29 / 35

Experiment T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 30 / 35

Experiment T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 31 / 35

Experiment Sensitivity with Batchsize Batch Size 1 4 16 64 256 1024 Batch 1 1 0.9 0.9 0.8 0.8 0.7 0.7 Test Error 0.6 0.5 Test Error 0.6 0.5 0.4 0.4 0.3 0.3 0.2 0.2 10 0 10 1 10 2 10 3 10 4 10 5 CPU Seconds (Log Scale) 10 0 10 1 10 2 10 3 10 4 10 5 CPU Seconds (Log Scale) T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 32 / 35

Experiment Online MedHDP on 20NG pamedhdp ave pamedhdp gibbs bmedhdp tfhdp+svm 0.9 10 4 0.85 Accuracy 0.8 0.75 0.7 0.65 Time(s) 10 3 10 2 0.6 0.55 20 22 24 26 28 30 #Topic 10 1 20 22 24 26 28 30 #Topic T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 33 / 35

Experiment Multi-Task Learning on Wikipedia Dataset Training data: 1.1 M Wikipedia articles, 20 categories. T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 34 / 35

Experiment Multi-Task Learning on Wikipedia Dataset Training data: 1.1 M Wikipedia articles, 20 categories. Test data: 5000 Wikipedia articles. T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 34 / 35

Experiment Multi-Task Learning on Wikipedia Dataset Training data: 1.1 M Wikipedia articles, 20 categories. Test data: 5000 Wikipedia articles. Vocabulary: 917,683 unique items. T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 34 / 35

Experiment Multi-Task Learning on Wikipedia Dataset Training data: 1.1 M Wikipedia articles, 20 categories. Test data: 5000 Wikipedia articles. Vocabulary: 917,683 unique items. Result 0.6 pamedldamt ave pamedhdpmt ave pamedldamt gibbs pamedhdpmt gibbs MedLDAmt MedHDPmt F1 Score 0.5 0.4 0.3 0.2 0.1 0 10 1 10 2 10 3 10 4 10 5 Time (Seconds) T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 34 / 35

Contribution Experiment A generic online learning framework. T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 35 / 35

Contribution Experiment A generic online learning framework. Extend Passive-Aggressive learning to Bayesian setting. T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 35 / 35

Contribution Experiment A generic online learning framework. Extend Passive-Aggressive learning to Bayesian setting. Generalize Sequential Bayesian Inference to discriminative learning. T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 35 / 35

Contribution Experiment A generic online learning framework. Extend Passive-Aggressive learning to Bayesian setting. Generalize Sequential Bayesian Inference to discriminative learning. Develop efficient online learning algorithm for max-margin topic models. T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 35 / 35