Online Bayesian Passive-Aggressive Learning"

Similar documents
Online Bayesian Passive-Aggressive Learning

Online Bayesian Passive-Agressive Learning

Online Bayesian Passive-Aggressive Learning

Sparse Stochastic Inference for Latent Dirichlet Allocation

Classical Predictive Models

Non-Parametric Bayes

Streaming Variational Bayes

A Unified Posterior Regularized Topic Model with Maximum Margin for Learning-to-Rank

Using Both Latent and Supervised Shared Topics for Multitask Learning

Lecture 13 : Variational Inference: Mean Field Approximation

29 : Posterior Regularization

Study Notes on the Latent Dirichlet Allocation

Fast Inference and Learning for Modeling Documents with a Deep Boltzmann Machine

Gibbs Sampling. Héctor Corrada Bravo. University of Maryland, College Park, USA CMSC 644:

Gaussian Mixture Model

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

Probabilistic Graphical Models

Small-variance Asymptotics for Dirichlet Process Mixtures of SVMs

Deep Poisson Factorization Machines: a factor analysis model for mapping behaviors in journalist ecosystem

Collapsed Variational Inference for HDP

Collapsed Variational Dirichlet Process Mixture Models

Introduction to Machine Learning

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Introduction to Probabilistic Machine Learning

CS Lecture 18. Topic Models and LDA

Faster Stochastic Variational Inference using Proximal-Gradient Methods with General Divergence Functions

Generative Clustering, Topic Modeling, & Bayesian Inference

Collapsed Variational Inference for Sum-Product Networks

13: Variational inference II

Variational Inference (11/04/13)

Bayesian Nonparametrics for Speech and Signal Processing

Interpretable Latent Variable Models

Variational Autoencoders

Introduction to Bayesian inference

CPSC 540: Machine Learning

Latent variable models for discrete data

arxiv: v1 [stat.ml] 5 Dec 2016

Graphical Models for Query-driven Analysis of Multimodal Data

Part 1: Expectation Propagation

Two Useful Bounds for Variational Inference

Bayesian Learning and Inference in Recurrent Switching Linear Dynamical Systems

Image segmentation combining Markov Random Fields and Dirichlet Processes

Evaluation Methods for Topic Models

Latent Dirichlet Bayesian Co-Clustering

Stochastic Variational Inference for the HDP-HMM

Probabilistic Time Series Classification

Dirichlet Enhanced Latent Semantic Analysis

Truncation-free Stochastic Variational Inference for Bayesian Nonparametric Models

Non-parametric Clustering with Dirichlet Processes

Latent Dirichlet Allocation Introduction/Overview

Spatial Bayesian Nonparametrics for Natural Image Segmentation

Nonparametric Bayesian Methods (Gaussian Processes)

Large-scale Ordinal Collaborative Filtering

Another Walkthrough of Variational Bayes. Bevan Jones Machine Learning Reading Group Macquarie University

Improving Topic Models with Latent Feature Word Representations

Variational Inference via Stochastic Backpropagation

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

STA 4273H: Statistical Machine Learning

Auto-Encoding Variational Bayes

Bayesian Methods for Machine Learning

CMU-Q Lecture 24:

Monte Carlo Methods for Maximum Margin Supervised Topic Models

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation

Deep Variational Inference. FLARE Reading Group Presentation Wesley Tansey 9/28/2016

Latent Dirichlet Alloca/on

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

16 : Approximate Inference: Markov Chain Monte Carlo

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

Spatial Normalized Gamma Process

Bayesian Machine Learning

Topic Models. Brandon Malone. February 20, Latent Dirichlet Allocation Success Stories Wrap-up

The supervised hierarchical Dirichlet process

Probabilistic Graphical Models

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

STA 4273H: Statistical Machine Learning

Discriminative Training of Mixed Membership Models

Collapsed Variational Bayesian Inference for Hidden Markov Models

Probabilistic Graphical Models for Image Analysis - Lecture 4

Logistic Regression. COMP 527 Danushka Bollegala

Infinite Latent SVM for Classification and Multi-task Learning

CS6220: DATA MINING TECHNIQUES

Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures

Online Passive-Aggressive Algorithms

Introduction to Graphical Models

ICML Scalable Bayesian Inference on Point processes. with Gaussian Processes. Yves-Laurent Kom Samo & Stephen Roberts

COMP90051 Statistical Machine Learning

Lecture 3a: Dirichlet processes

CMPS 242: Project Report

Document and Topic Models: plsa and LDA

Latent Dirichlet Allocation Based Multi-Document Summarization

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

CS145: INTRODUCTION TO DATA MINING

39th Annual ISMS Marketing Science Conference University of Southern California, June 8, 2017

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Scaling Neighbourhood Methods

Sparse Bayesian Logistic Regression with Hierarchical Prior and Variational Inference

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Transcription:

Online Bayesian Passive-Aggressive Learning" Tianlin Shi! stl501@gmail.com! Jun Zhu! dcszj@mail.tsinghua.edu.cn!

The BIG DATA challenge" Large amounts of data.! Big data:!! Big Science: 25 PB annual data.! Streaming data.! Image Courtesy: h0p://robo4c- rodents.com/ Complex data: text, images, genomic, etc.!

Online Learning" n Batch Learning" Data Learning Algorithm" Loss "L Model 1. Data may come in as a stream. 2. We don t have memory/4me to compute it! There is redundancy in data.

Online Learning" n Online Learning" Data instantaneous Loss Online" Learning Algorithm" Loss "L Predic4on (Supervised Case) Model Update

! Online Passive-Aggressive Learning Crammer et al. 2006 [1]" Online update of a large-margin classifier.! SVM weight! w {xt } t 1 {y t } t 1 From sequential data and labels.! Close-form update rule!! Drawback:!! 1. Limited model complexity.! 2. Single estimate of the model.!

Bayesian models" Flexibility! Can be non-parametric! e.g. Infinite number of components in a topic model.! " " " " " " " "Teh et al. HDP. JASA 2006.! Posterior inference is challenging!! Both VB and MCMC can be expensive in big data.! Attempts to speed up the inference:! Online LDA. Hoffman et al. NIPS 2010! Online Sparse Stochastic Inference. Mimno et al. ICML 2012! Stochastic Gradient Fisher Score. Ahn et al. ICML 2012.! Typically are lack of discriminative ability.!

Max-Margin Bayesian Models" MED: Max-entropy discrimination. Jaakkola et al. 1999! MED with latent variables.! MedLDA. Zhu el al. JMLR 2012.! MED with nonparametric Bayesian inference.! M3F: Max-Margin Matrix Factorization. Xu el at. NIPS 2012.! Posterior inference remains a big challenge!"

Online Bayesian Passive-Aggressive Learning (BayesPA) "

Outline" General formulation! Online max-margin topic models! Experiment! Future work!

Outline" General formulation Online max-margin topic models! Experiment! Future work!

Online PA Algorithms" Update weight: w t+1 = min w w w t 2 s.t.:l ε (w; x t, y t ) = 0. Case I (Passive Update): l ε = 0 Online BayesPA Learning" Update distribu4on of weight: q t+1 (w) = min KL[q(w) q t (w)] E q [log p(x t w)] q F t s.t.:l ε (q(w); x t, y t ) = 0. Case I (Passive Update): l ε = 0 w t+1 = w t q t (w) q t+1 (w) feasible zone. Case II (Aggressive Update): l ε > 0 feasible zone. Case II (Aggressive Update): l ε = 0 q t+1 (w) w t feasible zone. q t (w) feasible zone.

Online PA Algorithms" Online BayesPA Learning" Soft-margin Constraints:" w t+1 = q t+1 (w) = Soft-margin Constraints:" min w w w t 2 + 2l ε (w; x t, y t ). Loss l ε " Hinge loss for classification." max(0,ε y t w x t ) Epsilon-insensitive loss for regression." max(0, w x t y t ε) Close-form update rule:" w t+1 = w t + y t τ t x t, min KL[q(w) q t (w)p(x t w)] q F t +2l ε (q(w); x t, y t ). We focus on classifiers for now." Averaging classifiers" Gibbs Classifiers, draw sample w ~ q(w) " and predict " τ t = min(c, l ε x t 2 ) where" (x) + = max(0, x)

Lemma 1. The expected hinge loss hinge loss! l ε Bayes l ε Gibbs l ε Gibbs l ε Bayes is an upper bound of the Proof. Straightforward: convexity of " (x) +

Lemma 2. If q 0 (w) = N (0, I), F t = P and we use averaging classifier, the non-likelihood BayesPA subsumes the online PA.! Non-likelihood! BayesPA:! min q F t KL[q(w) q t (w)] E q [log p(x t w)] s.t.:l ε (q(w); x t, y t ) = 0.

Proof Sketch. min KL[q(w) q t (w)]+ 2cmax(0,ε y t E q [w x t ]) q(w) P Conjugacy." (Zhu et al. 2012 RegBayes) For feature function ψ and convex function g,! min KL[q(M) p(m,d)]+ g(e q [ψ (M)]) q(m) P = max log φ M Where the optimal solution is! p(m,d)exp( φ,ψ (M) ) g * ( φ) q(m) p(m,d)exp( φ *,ψ (M) )

min KL[q(w) q t (w)]+ 2cmax(0,ε y t E q [w x t ]) q(w) P = max logγ(τ ) I[0 τ c] τ Where " q * (w) = 1 Γ(τ ) q t (w)exp(τ (y t w x t ε)) Use induction, assume " q t (w) = N (w;µ t,σ 2 I) Initial Case" q 0 (w) = N (w;0,σ 2 I) So" q t+1 (w) exp( 1 2σ 2 w µ t 2 +τ (y t w x t ε)) Dual form:" Primal form:" min τ y t µ t x t 1 τ 2σ τ 2 x 2 t x t + ετ min µ µ µ t 2 2σ 2 + cmax(0,ε y t µ x t )

Lemma 3. If F t = P BayesPA is! and we use Gibbs classifier, the update rule of q t+1 (w) q t (w)p(x t w)e 2c(ε y tw x t ) + Prior" Likelihood" Pseudo-likelihood"

Extension: Learning with Mini-Batch" At 4me t, we have incoming batch B t where min q F t KL[q(w) q t (w)p(x t w t )]+ 2l ε (q(w); X t,y t ). X t = {x t } t Bt Y t = {x t } t Bt and l ε (q(w); X t,y t ) = l ε (q(w); x d, y d ) d B t

Extension: Learning with Latent Structures" Data" Classifier weight" w x 5 x 4 x 3 x 2 x 1 h 5 h 4 h 3 h 2 h 1 Latent Structure " H Model M

Extension: Learning with Latent Structures" Uncertainty in H t "" " "à Infer H together with M,w t via BayesPA rule." min q F t KL[q(w,M,H t ) q t (w,m)p(x t w,m,h t )]+ 2l ε (q(w,m,h t ); x t, y t ). " But how can we obtain" q t+1 "(w,m) " "?" Marginalize " q(w,m,h " " t )" " " " " " " " "à Intractable" Mean-Field Assumption! " " q(w,m,h t ) = q(w)q(m)q(h t ) Solve the objective and use " q t+1 (w,m) = q * (w)q * (M) "

Outline" General formulation! Online max-margin topic models" Experiment! Future work!

Batch MedLDA" n Graphical Interpretation" β φ k K α θ d z di x di n d y d D w v 2

For each topic k = 1, 2,, K" φ k ~ Dir(β) w k ~ N (w k ;0,v 2 ) For each document d = 1, 2,, D" θ d ~ Dir(α ) For each i-th word in document d" z di ~ Multi(θ d ) x di ~ Multi(Φ zdi ) Predict" f (w,z d ) = w z d where" z dk = 1 I[z di = k] n d i

Batch MedLDA" n Inference of LDA" Let" Φ = {φ k } K k=1,θ = {θ d } D d=1,z = {z d } D D d=1, X = {x d } d=1 LDA infers posterior " Or equivalently solves" p(φ,θ,z X) p 0 (Φ,Θ,Z)p(X Z,Φ) min KL[q(Φ,Θ,Z) p(φ,θ,z X)] q P

Batch MedLDA" n Inference of MedLDA" Inference problem:" min KL[q(Φ,Θ,Z) p(φ,θ,z X)]+ 2 l ε(q(w,z d ); x d, y d )} q P Define a prediction model:" Loss function:" f (w,z d ) = w z d where" z dk = 1 I[z di = k] n d Averaging loss: " l Avg ε (q(w,z d ); x d, y d ) = (ε y d E q [ f (w,z d )]) + Gibbs loss:" l Gibbs ε (q(w,z d ); x d, y d ) = E q [(ε y d f (w,z d )) + ] D d=1 i

Online MedLDA " Recall BayesPA with latent structures.! min q F t KL[q(w,M,H t ) q t (w,m)p(x t w,m,h t )]+ 2l ε (q(w,m,h t ); x t, y t ). In MedLDA, we have M = Φ,H t = (Θ t,z t )! But to reduce parameter space, we collapse out! Pr[Z d α ] = Pr[Z d θ d ]Pr[θ d α ]dθ d = D(α + C d ), d B t θd D(α ) So! M = Φ,H t = Z t Exact inference is hard!! " " "à Mean-field assumption! Θ t q(w,φ,z t ) = q(w)q(φ)q(z t )

! Online MedLDA with Gibbs classifiers" By Lemma 3, the optimal solution has form! "where! q t+1 (w,φ,z t ) q t (w,φ)p 0 (Z t α )p(x t Z t,m)ψ (Y t w,z t ) ψ (Y t w,z t ) = ψ (Y d w,z d ) d B t ψ (y d w,z d ) = e 2c(ε y d w z d ) + Looks not friendly!"

Lemma: Scale of Mixture (Zhu el al. 2013) The pseudo-likelihood can be expressed as! ψ (y d w,z d ) = λd =0 1 exp( (λ + cξ d d )2 )dλ d 2πλ d 2λ d Let! ψ (Y t,λ t w,z t ) = ψ (Y d,λ d w,z d ) ψ (y d,λ d w,z d ) = d B t 1 exp( (λ + cξ d d )2 ) 2πλ d 2λ d So our posterior at round t can be expressed as! q t+1 (w,φ,z t ) q t (w,φ)p 0 (Z t α )p(x t Z t,φ)ψ (Y t,λ t w,z t ) Again, mean-field assumption:! q(w,φ,z t,λ t ) = q(w)q(φ)q(z t,λ t )

Fix q(z t,λ t ) Online Gibbs MedLDA: Global Update" with the mean-field assumption, obviously" If initially" Then we have the update rule"

Fix q(w,φ) Online Gibbs MedLDA:, we have! Local Update" 1 q(z t,λ t ) p 0 (Z t ) exp( 2πλ Λ E zdi,x q(φ,w) [ (λ + cξ d d )2 ]) di d i [n d ] 2λ d where! d B t * E q(φ) [log(φ zdi,x di )] = Ψ(Δ zdi,x di ) Ψ( Δ * z di,x ) x But hard to evaluate the expectation using the above formula.!! No close form! Z has a large number of combinations! Gibbs Sampling!"

Online Gibbs MedLDA: Gibbs Sampling" For! Z t For! λ t λ d 1 follows inverse Gaussian distribution:! 1 λ 1 d ~ IG(λ 1 d ; c ξ 2 d + z d Σ * z )

Nonparametric Extension" n MED Hierarchical Dirichlet Process (MedHDP)" Stick Breaking Process" β φ k k = 1,2,..., π k θ d z di x di k = 1,2,..., n d y d D w Gaussian Process"

Nonparametric Extension" n Stick Breaking Process" π 1 π 2 π 3. n Generate Topic Portion" π 1 π 2 π 3. θ d ~ Dir(απ )

MedHDP" Draw topic portion from, the rest is the same as LDA." " Inference " π min q P KL[q(w,π,Φ,Θ,Z) p(w,π,φ,θ,z X)]+ 2c l ε (q(w,z d ); x d, y d )) Loss function is almost the same as LDA, expect for prediction rule." f (w,z d ) = The term is essentially finite." k=1 w k z dk D d=1

! Online Nonparametric MedLDA" Recall BayesPA with latent structures.! min q F t KL[q(w,M,H t ) q t (w,m)p(x t w,m,h t )]+ 2l ε (q(w,m,h t ); x t, y t ). In MedHDP, we! Θ Collapse out.! Introduce an auxiliary latent variable! "We can show! p(s d,z d π ) S(n d z dk,s dk )(απ k ) s dk Exact inference is hard à Mean-field assumption k=1 p(z d π ) = p(s d,z d π ) s d

!! Online Nonparametric MedLDA Global Update" For For Φ,w, same update rule as in online MedLDA.! π, by mean-field assumption,! If initially,! By induction,! q(π k ) = Beta(u k 0,v k 0 ) where the update rule is! u k * = u k t v k * = v k t + E q [s dk ] d B t + E q [ s dj ] d B t j>k

Online Nonparametric MedLDA Local Update" Fix global distribution,! Φ!q(Z t,s t ) = exp(e q(φ)q(π ) [log(p(x t Φ,Z t ) + log(z t,s t π )])!q(z t,λ t ) = exp(e q [logψ (Y t,λ t w,z t )]) But has infinite number of components! à Solution:! Borrow ideas from Wang & Blei NIPS 2012, approximate! π Z t,s t,λ t sample together with,! use the direct sampling scheme for HDP. Teh et al. HDP. JASA 2006.!!!

Online Nonparametric MedLDA Gibbs Sampling" For! Z t λ t For, the same as in online MedLDA.! For! π k a k = u k * + b k = v k * + d B t s dk d B t j>k s dj

Outline" General formulation! Online max-margin topic models! Experiment Future work!

Classification on 20NG" 20Newsgroup! 20 Categories of documents.! Training/testing split: 11269/7505.! Test online MedLDA (pamedlda)! " "and online MedHDP (pamedhdp).! Compare with! Batch counterparts.! Gibbs MedLDA. (Zhu et al. ICML 2013).! Topic model + SVM. Sparse Stochastic LDA (mimno et al. ICML 2012), truncation-free HDP (Wang & Blei NIPS 2012).!

Sensitivity with Batch Size"

Sensitivity with Iterations and Samples"

Multi-Task Classification" Extend our algorithm to multi-task learning.! Label 1 y d 1 x d Label 2 y d 2 Label 2 y d T Simply solve!

Multi-Task Classification" 1.1 M wikipedia dataset.! 20 kinds of label, not necessarily exclusive! Training/Testing split: 1.1 M / 5 K! F1 score:! 2 precision recall precision + recall

Future Work" Theoretical analysis of BayesPA.! Parallel asynchronous BayesPA learning.! BayesPA learning for regression problems.!

Reference"

Thank you."