Online Bayesian Passive-Agressive Learning

Similar documents
Online Bayesian Passive-Aggressive Learning

Online Bayesian Passive-Aggressive Learning"

Online Bayesian Passive-Aggressive Learning

Sparse Stochastic Inference for Latent Dirichlet Allocation

Naïve Bayes classification

Small-variance Asymptotics for Dirichlet Process Mixtures of SVMs

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Introduction to Probabilistic Machine Learning

Using Both Latent and Supervised Shared Topics for Multitask Learning

Classical Predictive Models

A Unified Posterior Regularized Topic Model with Maximum Margin for Learning-to-Rank

Fast Inference and Learning for Modeling Documents with a Deep Boltzmann Machine

Study Notes on the Latent Dirichlet Allocation

Lecture 13 : Variational Inference: Mean Field Approximation

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Bayesian Machine Learning

Generative Clustering, Topic Modeling, & Bayesian Inference

Bayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan

Topic Models. Brandon Malone. February 20, Latent Dirichlet Allocation Success Stories Wrap-up

Discriminative Training of Mixed Membership Models

CS Lecture 18. Topic Models and LDA

29 : Posterior Regularization

Dirichlet Enhanced Latent Semantic Analysis

Non-Parametric Bayes

Introduction to Machine Learning

Latent variable models for discrete data

Probabilistic Graphical Models

Kernel Density Topic Models: Visual Topics Without Visual Words

Monte Carlo Methods for Maximum Margin Supervised Topic Models

Posterior Regularization

Topic Modeling: Beyond Bag-of-Words

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Statistical Data Mining and Machine Learning Hilary Term 2016

Active and Semi-supervised Kernel Classification

Bayesian Nonparametrics for Speech and Signal Processing

Support Vector Machines

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Stochastic Variational Inference for the HDP-HMM

Probabilistic Time Series Classification

Click Prediction and Preference Ranking of RSS Feeds

Topic Models. Advanced Machine Learning for NLP Jordan Boyd-Graber OVERVIEW. Advanced Machine Learning for NLP Boyd-Graber Topic Models 1 of 1

MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

Evaluation Methods for Topic Models

Spatial Bayesian Nonparametrics for Natural Image Segmentation

Logistic Regression. COMP 527 Danushka Bollegala

Applying LDA topic model to a corpus of Italian Supreme Court decisions

Pattern Recognition and Machine Learning

Gaussian Mixture Model

Latent Dirichlet Allocation Introduction/Overview

Machine Learning for NLP

13 : Variational Inference: Loopy Belief Propagation and Mean Field

Introduction to Machine Learning

Machine Learning, Fall 2012 Homework 2

Collapsed Variational Inference for HDP

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

Collapsed Variational Bayesian Inference for Hidden Markov Models

Topic Modelling and Latent Dirichlet Allocation

Priors for Random Count Matrices with Random or Fixed Row Sums

13: Variational inference II

Truncation-free Stochastic Variational Inference for Bayesian Nonparametric Models

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Pachinko Allocation: DAG-Structured Mixture Models of Topic Correlations

Machine Learning Summer School, Austin, TX January 08, 2015

Introduction to Bayesian inference

Latent Dirichlet Allocation Based Multi-Document Summarization

Streaming Variational Bayes

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

STA 4273H: Statistical Machine Learning

Introduction to Machine Learning

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm

Latent Dirichlet Bayesian Co-Clustering

Non-parametric Clustering with Dirichlet Processes

Gaussian Models

Deep Poisson Factorization Machines: a factor analysis model for mapping behaviors in journalist ecosystem

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

Bayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI

Latent Dirichlet Allocation (LDA)

Large-scale Ordinal Collaborative Filtering

Bayesian Methods: Naïve Bayes

Learning Bayesian network : Given structure and completely observed data

Generalized Relational Topic Models with Data Augmentation

Collapsed Variational Inference for Sum-Product Networks

Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures

Density Estimation. Seungjin Choi

Nonparametric Bayesian Methods (Gaussian Processes)

Image segmentation combining Markov Random Fields and Dirichlet Processes

Notes on Machine Learning for and

LDA with Amortized Inference

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Infinite latent feature models and the Indian Buffet Process

Generalized Relational Topic Models with Data Augmentation

An Introduction to Bayesian Machine Learning

Decoupling Sparsity and Smoothness in the Discrete Hierarchical Dirichlet Process

Sampling Equation Derivation for Lex-MED-RTM

Information retrieval LSI, plsi and LDA. Jian-Yun Nie

Relevance Vector Machines

arxiv: v1 [stat.ml] 30 Dec 2009

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Transcription:

Online Bayesian Passive-Agressive Learning International Conference on Machine Learning, 2014 Tianlin Shi Jun Zhu Tsinghua University, China 21 August 2015 Presented by: Kyle Ulrich

Introduction Online Passive-Aggressive learning (Crammer et al., 2006) provides online large-margin discriminative methods However, this method provides a point estimate, eliminating the ability to include rich Bayesian structures (Hjort, 2010) (Teh et al., 2006) (Ghahramani & Griffiths, 2005) Maximum entropy discrimination (Jaakkola et al., 1999) conjoins max-margin learning and Bayesian generative models Bayesian Passive-Aggressive learning is proposed as an online learning method for Bayesian max-margin models

Online Passive-Aggressive Learning At time t, observe data x t and response y t { 1, 1} Consider a binary classifier parameterized by w The hinge loss: l ɛ (w; x t, y t ) = (ɛ y t w T x t ) + The PA update rule (Crammer et al., 2006) w t+1 = arg min w has two behaviors: 1 2 w w t 2 s.t.: l ɛ (w; x t, y t ) = 0 (1) 1 Passively assign w t+1 = w t if l ɛ (w; x t, y t ) = 0, or 2 Aggressively project w t into the feasible zone of zero loss

Online Passive-Aggressive Learning The alternative Lagrangian has the form w t+1 = arg min w Simple update rules may be derived: 1 2 w w t 2 + 2cl ɛ (w; x t, y t ) (2) l t ɛ = (ɛ y t w T t x t ) + c t = lt ɛ 2 x t 2 w t+1 = w t + c t y t x t Allowing for soft-margin constraints provides flexibility for inseparable training examples

Maximum Entropy Discrimination MED is designed for discriminative tasks such as classification or anomaly detection (Jaakkola et al., 1999) w is a random variable with prior p(w) A distribution q (w) is found from min KL[q(w) p(w)] s.t. E q(w)[l ɛ (w; x t, y t )] ζ, t (3) q(w) Support Vector Machines belong to this class of models

Relationship to Bayesian Inference MED resembles Bayesian inference as an optimization problem Bayes rule provides the solution to min KL[q(w) p(w)] E q(w)[log p(x w)] (4) q(w) and mean-field variational inference provides the constraint q(w) = i q(w i ) Therefore, MED introduces a psuedo-likelihood as a discriminative model instead of the likelihood in the Bayesian generative model These functions may not have a probabilistic interpretation

Online Bayesian Passive-Aggressive Learning Alternatively, consider sequentially updating a new posterior distribution q t+1 (w): min q(w) F t KL[q(w) q t (w)] E q(w) [log p(x t w)] s.t.: l ɛ (q(w); x t, y t ) = 0 (5) This update has two behaviors: 1 Passively pass the posterior q t+1 (w) q t (w)p(x t w) under no loss, l ɛ (q(w); x t, y t ) = 0 2 Agressively project the posterior to a feasible zone of zero loss When no likelihood is defined (p(x t w) is independent of w), passively pass q t+1 (w) = q t (w)

Online Bayesian Passive-Aggressive Learning The Lagrangian of the optimization problem, q t+1 (w) = arg min L(q(w)) + 2cl ɛ (q(w); x t, y t ) (6) q(w) F t For max-margin classifiers, consider two loss functions Averaging classifier: Gibbs classifier: l Avg ɛ (q(w); x t, y t ) = (ɛ y t E q(w) [w T x t ]) + (7) l Gibbs ɛ (q(w); x t, y t ) = E q(w) [(ɛ y t w T x t ) + ] (8) The expected hinge loss is an upper bound of the average loss: l Gibbs ɛ l Avg ɛ

Update Rule From Bayes rule with the Gibbs classifier, the solution is q t+1 (w) q t (w) p(x }{{} t w) exp( 2c(ɛ y }{{} t w T x t ) + ) }{{} Prior Likelihood Pseudo-likelihood (9)

Mini-Batches Observe a mini-batch of data points at time t Data X t = {x d } d Bt Labels Y t = {y d } d Bt The loss functions are simply appended: q t+1 (w) = arg min L(q(w)) + 2c l ɛ (q(w); X t, Y t ) q F t }{{} = d Bt lɛ(q(w);x d,y d ) (10)

Latent Structures Bayesian models are extensively developed with global-local structures Global variables M share properties across the dataset Local variables H t = {h d } d Bt characterize each X t In general, the posterior q t+1 (w, M, H t ) is the solution to where min L(q(w, M, H t )) + 2cl ɛ (q(w, M, H t ); X t, Y t ) (11) q F t L(q) = KL[q q t (w, M)p 0 (H t )] E q [log p(x t w, M, H t )] To decouple global/local structures in posterior, consider mean field assumptions: q(w, M, H t ) = q(w)q(m)q(h t ), q t+1 (w, M) = q (w)q (M)

Latent Dirichlet Allocation Model Consider the LDA model (Blei et al., 2003) θ d Dir(α) φ k Dir(γ) z di Mult(θ d ) x di Mult(φ zdi ) x di is the i-th word in document d z di is the topic assigned to this word θ d is a distribution of topics for document d φ k is a distribution over the W -word vocabulary for topic k Topic assignments should be predictive of class labels y d The online BayesPA objective for MedLDA (Zhu et al., 2012): min q KL [q(w, Φ, Z t ) q t (w, Φ)p 0 (Z t )p(x t Φ, Z t )] + 2c d B t l ɛ (q(w); z d, y d ) (12) An efficient algorithm for online MedLDA is derived

Latent Dirichlet Allocation Model (Some Details) The pseudo-likelihood is ψ(y d z d, w) = exp( 2c(ζ d ) + ), ζ d = ɛ y d f (w, z d ) A data augmentation scheme is used (Zhu et al., 2013), where ψ(y d z d, w) = 0 ψ(y d, λ d z d, w)dλ d ψ(y d, λ d z d, w)dλ d = (2πλ d ) 1/2 exp( (λ d + cζ d ) 2 2λ d ) The following optimization algorithm may now be solved with iterative updates: min L(q(w, Φ, Z t, λ t )) E q [log ψ(y t, λ t Z t, w)] (13) q P

Hierarchical Dirichlet Process The authors extend to a nonparametric latent variable model The HDP has the alternative construction on topic mixing proportions (Teh et al., 2006) (Wang & Blei, 2012): θ d Dir(απ), π k = π k (1 π i ), π k Beta(1, γ) (14) i<k The authors propose a solution to min q P L(q(w, π, Φ, H t)) E q [log ψ(y t, λ t Z t, w)] (15) Details may be found in the paper

Experiment Multi-class classification on the 20 Newsgroup dataset 11,269 training documents; 7,505 test documents One-vs-all strategy for multi-class classifcation Comparisons Passive-Aggressive algorithms pamedlda and pamedhdp Batch counterparts bmedlda and bmedhdp Sparse inference for LDA splda Truncation-free online variational HDP tfhdp MED parameters: ɛ = 164, c = 1

Error vs. Passes Through Data

Accuracy and Training Time

Error vs. Training Time