Putting the Bayes update to sleep

Similar documents
Putting Bayes to sleep

The Blessing and the Curse

CS281B/Stat241B. Statistical Learning Theory. Lecture 14.

Learning, Games, and Networks

Leaving The Span Manfred K. Warmuth and Vishy Vishwanathan

Clustering and Gaussian Mixture Models

Putting Bayes to sleep

On-line Variance Minimization

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

Multivariate Bayesian Linear Regression MLAI Lecture 11

Lecture 8. Instructor: Haipeng Luo

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Linear Models for Classification

Non-Parametric Bayes

Based on slides by Richard Zemel

Mixtures of Gaussians. Sargur Srihari

Algorithm-Independent Learning Issues

Online prediction with expert advise

U Logo Use Guidelines

PetaBricks: Variable Accuracy and Online Learning

Learning with Large Number of Experts: Component Hedge Algorithm

Ad Placement Strategies

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

K-Lists. Anindya Sen and Corrie Scalisi. June 11, University of California, Santa Cruz. Anindya Sen and Corrie Scalisi (UCSC) K-Lists 1 / 25

Machine Learning for NLP

Logistic Regression. COMP 527 Danushka Bollegala

Applications of on-line prediction. in telecommunication problems

Midterm, Fall 2003

Hidden Markov Models. Vibhav Gogate The University of Texas at Dallas

Lecture 6: Model Checking and Selection

Statistical learning. Chapter 20, Sections 1 4 1

CIS519: Applied Machine Learning Fall Homework 5. Due: December 10 th, 2018, 11:59 PM

Notes on AdaGrad. Joseph Perla 2014

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

CSCE 478/878 Lecture 6: Bayesian Learning

Assessing Regime Uncertainty Through Reversible Jump McMC

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS

Ways to make neural networks generalize better

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Deep Poisson Factorization Machines: a factor analysis model for mapping behaviors in journalist ecosystem

Spatial Bayesian Nonparametrics for Natural Image Segmentation

Online Kernel PCA with Entropic Matrix Updates

Introduction to Probabilistic Machine Learning

Collapsed Gibbs Sampler for Dirichlet Process Gaussian Mixture Models (DPGMM)

CPSC 540: Machine Learning

THE WEIGHTED MAJORITY ALGORITHM

Bayesian Support Vector Machines for Feature Ranking and Selection

Ad Placement Strategies

Approximate Inference Part 1 of 2

CSEP 573: Artificial Intelligence

Manfred K. Warmuth - UCSC S.V.N. Vishwanathan - Purdue & Microsoft Research. Updated: March 23, Warmuth (UCSC) ICML 09 Boosting Tutorial 1 / 62

Approximate Inference Part 1 of 2

Will Penny. SPM for MEG/EEG, 15th May 2012

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning

Latent Variable Models and Expectation Maximization

CPSC 540: Machine Learning

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Algorithms for Classification: The Basic Methods

Stochastic Processes, Kernel Regression, Infinite Mixture Models

Optimal and Adaptive Online Learning

MACHINE LEARNING AND PATTERN RECOGNITION Fall 2006, Lecture 8: Latent Variables, EM Yann LeCun

Markov Networks. l Like Bayes Nets. l Graphical model that describes joint probability distribution using tables (AKA potentials)

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Expectation Maximization

Notes and Solutions #6 Meeting of 21 October 2008

Will Penny. DCM short course, Paris 2012

Naïve Bayes classification

More Spectral Clustering and an Introduction to Conjugacy

Will Penny. SPM short course for M/EEG, London 2013

Probability models for machine learning. Advanced topics ML4bio 2016 Alan Moses

Online Learning and Sequential Decision Making

Bregman Divergences for Data Mining Meta-Algorithms

CSC2515 Assignment #2

Latent Variable Models and Expectation Maximization

Probabilistic Graphical Models

The Free Matrix Lunch

Markov Networks. l Like Bayes Nets. l Graph model that describes joint probability distribution using tables (AKA potentials)

Machine Learning! in just a few minutes. Jan Peters Gerhard Neumann

The information complexity of sequential resource allocation

The Naïve Bayes Classifier. Machine Learning Fall 2017

SVM optimization and Kernel methods

Part 1: Expectation Propagation

Advanced Topics in Machine Learning and Algorithmic Game Theory Fall semester, 2011/12

New Algorithms for Contextual Bandits

Clustering, K-Means, EM Tutorial

Least Squares Regression

Learning Task Grouping and Overlap in Multi-Task Learning

HMM part 1. Dr Philip Jackson

Model Comparison. Course on Bayesian Inference, WTCN, UCL, February Model Comparison. Bayes rule for models. Linear Models. AIC and BIC.

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

CS 188: Artificial Intelligence. Bayes Nets

Using Additive Expert Ensembles to Cope with Concept Drift

Statistical learning. Chapter 20, Sections 1 3 1

Clustering non-stationary data streams and its applications

Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling

Gibbs Sampling Methods for Multiple Sequence Alignment

CS 188: Artificial Intelligence Spring Announcements

Gaussian Processes for Machine Learning

Transcription:

Putting the Bayes update to sleep Manfred Warmuth UCSC AMS seminar 4-13-15 Joint work with Wouter M. Koolen, Dmitry Adamskiy, Olivier Bousquet

Menu How adding one line of code to the multiplicative update imbues the algorithm with a long-term memory and makes it robust for data that shifts between a small number of modes MW (UCSC) Putting the Bayes update to sleep 2 / 43

Menu How adding one line of code to the multiplicative update Outline imbues the algorithm with a long-term memory and makes it robust for data that shifts between a small number of modes The Disk spin down problem The Mixing Past Posterior Update - long term versus short term memory Bayes in the tube The curse of the multiplicative update The two-track update - another update with a long-term memory Theoretical justification: Putting Bayes to sleep Open problems MW (UCSC) Putting the Bayes update to sleep 2 / 43

Disk spindown problem Experts a set of n fixed timeouts : τ 1, τ 2,..., τ n Master Alg. maintains a set of weights / shares: s 1, s 2,..., s n Multiplicative update Problem s t+1,i = s t,i e η energy usage of timeout i Z MW (UCSC) Putting the Bayes update to sleep 3 / 43

A simple way of making the algorithm robust Mix in little bit of uniform vector (initial prior) Has a Bayesian interpretation s = Multiplicative Update s = (1 α) s + α ( 1 N, 1 N,..., 1 N ) where α is small Called Fixed Share to Uniform Past MW (UCSC) Putting the Bayes update to sleep 4 / 43

A better way of making the algorithm robust Keep track of past average share vector r s = Multiplicative Update s = (1 α) s + α r Called Mixing Past Posterior Update (w. uniform distribution on the past) Facilitates switch to previously useful vector Long-term memory No obvious Bayesian interpretation MW (UCSC) Putting the Bayes update to sleep 5 / 43

Total loss as a function of time A D A D G A D MW (UCSC) Putting the Bayes update to sleep 6 / 43

The weights of the models over time trial 0 200 400 600 800 1000 1200 1400 0-5 log weight -10-15 A D G others A D A D G A D MW (UCSC) Putting the Bayes update to sleep 7 / 43

The family of online learning algorithms Bayes MW (UCSC) Putting the Bayes update to sleep 8 / 43

The family of online learning algorithms Bayes homes in fast MW (UCSC) Putting the Bayes update to sleep 8 / 43

The family of online learning algorithms Bayes homes in fast simple, elegant baseline MW (UCSC) Putting the Bayes update to sleep 8 / 43

The family of online learning algorithms Bayes homes in fast simple, elegant baseline Fixed Share MW (UCSC) Putting the Bayes update to sleep 8 / 43

The family of online learning algorithms Bayes homes in fast simple, elegant baseline Fixed Share + adaptivity MW (UCSC) Putting the Bayes update to sleep 8 / 43

The family of online learning algorithms Bayes homes in fast simple, elegant baseline Fixed Share + adaptivity simple, elegant good performance MW (UCSC) Putting the Bayes update to sleep 8 / 43

The family of online learning algorithms Bayes Fixed Share Mixing Past Posteriors homes in fast simple, elegant baseline + adaptivity simple, elegant good performance MW (UCSC) Putting the Bayes update to sleep 8 / 43

The family of online learning algorithms Bayes Fixed Share Mixing Past Posteriors homes in fast + adaptivity + sparsity simple, elegant simple, elegant baseline good performance MW (UCSC) Putting the Bayes update to sleep 8 / 43

The family of online learning algorithms Bayes Fixed Share Mixing Past Posteriors homes in fast + adaptivity + sparsity simple, elegant simple, elegant self-referential mess baseline good performance great performance MW (UCSC) Putting the Bayes update to sleep 8 / 43

Our breakthrough Bayesian interpretation Mixing Past Posteriors Mixing Past Posteriors MW (UCSC) Putting the Bayes update to sleep 9 / 43

Quick lookahead! The trick is to put some models to sleep Asleep model don t predict Their weights not updated Later... The Gaussian Mixture example MW (UCSC) Putting the Bayes update to sleep 10 / 43

The updates in Bayesian notation Fixed Share s temp m = s new m P(y m) sold m j P(y m ) s old m Bayes posterior = (1 α) s temp m + α uniform Bayes over partition models MW (UCSC) Putting the Bayes update to sleep 11 / 43

The updates in Bayesian notation Fixed Share s temp m = s new m P(y m) sold m j P(y m ) s old m Bayes posterior = (1 α) s temp m + α uniform Bayes over partition models Mixing Past Posteriors s temp m = s new m P(y m) sold m j P(y m ) sm old = (1 α) sm temp Bayes posterior + α past average s temp m weights MW (UCSC) Putting the Bayes update to sleep 11 / 43

The updates in Bayesian notation Fixed Share s temp m = s new m P(y m) sold m j P(y m ) s old m Bayes posterior = (1 α) s temp m + α uniform Bayes over partition models Mixing Past Posteriors s temp m = s new m P(y m) sold m j P(y m ) sm old = (1 α) sm temp Bayes posterior + α past average s temp m weights Self-referential mess Will give Bayesian interpretation using the sleeping trick MW (UCSC) Putting the Bayes update to sleep 11 / 43

Multiplicative updates Blessing Speed Curse Loss of diversity How to maintain variety when multiplicative updates are used MW (UCSC) Putting the Bayes update to sleep 12 / 43

Multiplicative updates Blessing Speed Curse Loss of diversity How to maintain variety when multiplicative updates are used Next Implementation of Bayes in the tube Towards a rudimentary model of evolution MW (UCSC) Putting the Bayes update to sleep 12 / 43

In-vitro selection algorithm - for finding RNA strand that binds to protein Make tube of random RNA strands protein sheet for t=1 to 20 do 1 Dip sheet with protein into tube 2 Pull out and wash off RNA that stuck 3 Multiply washed off RNA back to original amount (normalization) MW (UCSC) Putting the Bayes update to sleep 13 / 43

In-vitro selection algorithm - for finding RNA strand that binds to protein Make tube of random RNA strands protein sheet for t=1 to 20 do 1 Dip sheet with protein into tube 2 Pull out and wash off RNA that stuck 3 Multiply washed off RNA back to original amount (normalization) Can t be done with computer because 10 15 RNA strands per liter of soup MW (UCSC) Putting the Bayes update to sleep 13 / 43

Basic scheme Start with unit amount of random RNA Loop 1 Functional separation into good RNA and bad RNA 2 Amplify good RNA to unit amount MW (UCSC) Putting the Bayes update to sleep 14 / 43

Duplicating DNA with PCR needed for normalization step Invented by Kary Mullis 1985 Heat - double stranded DNA comes apart Cool - Short primers hybridize at the ends Taq Polymerase runs along DNA strand and complements bases Back to double stranded DNA MW (UCSC) Putting the Bayes update to sleep 15 / 43

Duplicating DNA with PCR needed for normalization step Invented by Kary Mullis 1985 Heat - double stranded DNA comes apart Cool - Short primers hybridize at the ends Taq Polymerase runs along DNA strand and complements bases Back to double stranded DNA Ideally all DNA strands multiplied by factor of 2 MW (UCSC) Putting the Bayes update to sleep 15 / 43

The caveat Not many interesting/specific functional separations found Need high throughput for functional separation MW (UCSC) Putting the Bayes update to sleep 16 / 43

Mathematical description of in-vitro selection Name the unknown RNA strands as 1, 2,... n where n 10 15 (# of strands in 1 liter) s i 0 is share of RNA i MW (UCSC) Putting the Bayes update to sleep 17 / 43

Mathematical description of in-vitro selection Name the unknown RNA strands as 1, 2,... n where n 10 15 (# of strands in 1 liter) s i 0 is share of RNA i Contents of tube represented as share vector s = (s 1, s 2,..., s n ) When tube has unit amount, then s 1 + s 2 + + s n = 1 MW (UCSC) Putting the Bayes update to sleep 17 / 43

Mathematical description of in-vitro selection Name the unknown RNA strands as 1, 2,... n where n 10 15 (# of strands in 1 liter) s i 0 is share of RNA i Contents of tube represented as share vector s = (s 1, s 2,..., s n ) When tube has unit amount, then s 1 + s 2 + + s n = 1 W i [0, 1] is the fraction of one unit of RNA i that is good W i is fitness of RNA strand i for attaching to protein Protein represented as fitness vector W = (W 1, W 2,..., W n ) MW (UCSC) Putting the Bayes update to sleep 17 / 43

Mathematical description of in-vitro selection Name the unknown RNA strands as 1, 2,... n where n 10 15 (# of strands in 1 liter) s i 0 is share of RNA i Contents of tube represented as share vector s = (s 1, s 2,..., s n ) When tube has unit amount, then s 1 + s 2 + + s n = 1 W i [0, 1] is the fraction of one unit of RNA i that is good W i is fitness of RNA strand i for attaching to protein Protein represented as fitness vector W = (W 1, W 2,..., W n ) Vectors s, W unknown: Blind Computation! MW (UCSC) Putting the Bayes update to sleep 17 / 43

Mathematical description of in-vitro selection Name the unknown RNA strands as 1, 2,... n where n 10 15 (# of strands in 1 liter) s i 0 is share of RNA i Contents of tube represented as share vector s = (s 1, s 2,..., s n ) When tube has unit amount, then s 1 + s 2 + + s n = 1 W i [0, 1] is the fraction of one unit of RNA i that is good W i is fitness of RNA strand i for attaching to protein Protein represented as fitness vector W = (W 1, W 2,..., W n ) Vectors s, W unknown: Blind Computation! Strong assumption: Fitness W i independent of share vector s MW (UCSC) Putting the Bayes update to sleep 17 / 43

Update Good RNA in tube s: s 1 W 1 + s 2 W 2 + + s n W n = s W Amplification: Good share of RNA i is s i W i - multiplied by factor F If precise, then all good RNA multiplied by same factor F Final tube at end of loop F s W Since final tube has unit amount of RNA Update in each loop F s W = 1 and F = 1 s W s i := s iw i s W MW (UCSC) Putting the Bayes update to sleep 18 / 43

Implementing Bayes rule in RNA s i is prior P(i) W i [0..1] is probability P(Good i) s W is probability P(Good) s i := s iw i s W is Bayes rule P(i Good) = }{{} posterior prior {}}{ P(i) P(Good i) P(Good) MW (UCSC) Putting the Bayes update to sleep 19 / 43

Iterating Bayes Rule with same data likelihoods W 1 =.9 s 1 shares W 2 =.75 s 2 s 3 W 3 =.8 Initial s = (.01,.39,.6) W = (.9,.75,.8) t MW (UCSC) Putting the Bayes update to sleep 20 / 43

max i W i eventually wins.9 shares.84.7 Largest x i always increasing Smallest x i always decreasing.85 s t,i = s 0,i W i t normalization t is speed parameter MW (UCSC) Putting the Bayes update to sleep 21 / 43

The best are too good Multiplicative update s i s i fitness factor i }{{} 0 MW (UCSC) Putting the Bayes update to sleep 22 / 43

The best are too good Multiplicative update s i s i fitness factor i }{{} 0 Blessing Best get amplified exponentially fast Can do this with 10 15 variables MW (UCSC) Putting the Bayes update to sleep 22 / 43

The best are too good Multiplicative update s i s i fitness factor i }{{} 0 Blessing Best get amplified exponentially fast Can do this with 10 15 variables Curse The best wipe out the others Loss of diversity MW (UCSC) Putting the Bayes update to sleep 22 / 43

View of evolution Simple view Inheritance - mutation - selection for the fittest with multiplicative update MW (UCSC) Putting the Bayes update to sleep 23 / 43

View of evolution Simple view Inheritance - mutation - selection for the fittest with multiplicative update New view Multiplicative updates are brittle A mechanism is needed to ameliorate the curse of the multiplicative update Machine Learning and Nature uses various mechanisms Here we focus on the sleeping trick MW (UCSC) Putting the Bayes update to sleep 23 / 43

Long term memory with the 2 Track Update Multiplicative update + small soup exchange gives long-term memory MW (UCSC) Putting the Bayes update to sleep 24 / 43

Long term memory with the 2 Track Update Multiplicative update + small soup exchange gives long-term memory Awake Selection+PCR γ Asleep 1 γ γ = Pass through β α+β α β 1 α 1 β Selection+PCR Pass through α β 1 α 1 β MW (UCSC) Putting the Bayes update to sleep 24 / 43

2 sets of weights Awake weights: a Asleep weights: s 1. update: a m = multiplicative update of a s m = s ie pass thru 2. update: ã = (1 α) a m + β s m s = α a m + (1 β) s m MW (UCSC) Putting the Bayes update to sleep 25 / 43

Total loss as a function of time A D A D G A D MW (UCSC) Putting the Bayes update to sleep 26 / 43

The weights a of the awake models over time trial 0 200 400 600 800 1000 1200 1400 0-5 log weight -10-15 A D G others A D A D G A D MW (UCSC) Putting the Bayes update to sleep 27 / 43

The weights s of the asleep models over time trial 0 200 400 600 800 1000 1200 1400 0 log weight -5-10 -15 A D G others A D A D G A D MW (UCSC) Putting the Bayes update to sleep 28 / 43

Bayesian interpretation Point of departure: - We are trying to predict a sequence y 1, y 2,.... Definition A model issues a prediction each round. We denote the prediction of model m in round t by P(y t y <t, m). Say we have several models m = 1,..., M. How to combine their predictions? MW (UCSC) Putting the Bayes update to sleep 29 / 43

Bayesian answer Place a prior P(m) on models. Bayesian predictive distribution P(y t y <t ) = M P(y t y <t, m)p(m y <t ) m=1 where posterior distribution is incrementally updated by P(m y t ) = P(y t y <t, m)p(m y <t ) P(y t y <t ) Bayes is fast: predict in O(M) time per round. Bayes is good: regret w.r.t. model m on data y T bounded by T ( ln P(y t y <t ) ( ln P(y t y <t, m))) ln P(m) t=1 MW (UCSC) Putting the Bayes update to sleep 30 / 43

Specialists Definition (FSSW97,CV09) A specialist may or may not issue a prediction. Prediction P(y t y <t, m) only available for awake m W t. Say we have several specialists m = 1,..., M. How to combine their predictions? MW (UCSC) Putting the Bayes update to sleep 31 / 43

Specialists Definition (FSSW97,CV09) A specialist may or may not issue a prediction. Prediction P(y t y <t, m) only available for awake m W t. Say we have several specialists m = 1,..., M. How to combine their predictions? If we imagine a prediction for the sleeping specialists, we can use Bayes Choice: sleeping specialists predict with Bayesian predictive distribution That sounds circular. Because it is. $%!#? MW (UCSC) Putting the Bayes update to sleep 31 / 43

Bayes for specialists Place a prior P(m) on models. Bayesian predictive distribution P(y t y <t ) = P(y t y <t, m)p(m y <t ) + P(y t y <t )P(m y <t ) m W t m / W t MW (UCSC) Putting the Bayes update to sleep 32 / 43

Bayes for specialists Place a prior P(m) on models. Bayesian predictive distribution P(y t y <t ) = P(y t y <t, m)p(m y <t ) + P(y t y <t )P(m y <t ) m W t m / W t has solution P(y t y <t ) = m W t P(y t y <t, m)p(m y <t ). m W t P(m y <t ) MW (UCSC) Putting the Bayes update to sleep 32 / 43

Bayes for specialists Place a prior P(m) on models. Bayesian predictive distribution P(y t y <t ) = P(y t y <t, m)p(m y <t ) + P(y t y <t )P(m y <t ) m W t m / W t has solution m W P(y t y <t ) = t P(y t y <t, m)p(m y <t ). m W t P(m y <t ) The posterior distribution is incrementally updated by P(y t y <t,m)p(m y <t) P(y t y <t) if m W t, P(m y t ) = P(y t y <t)p(m y <t) P(y t y = P(m y <t) <t ) if m / W t. Bayes is fast: predict in O(M) time per round. Bayes is good: regret w.r.t. model m on data y T bounded by ( ln P(y t y <t ) + ln P(y t y <t, m)) ln P(m). t : 1 t T and m W t MW (UCSC) Putting the Bayes update to sleep 32 / 43

Bayes for specialists with awake vector a(m) is percentage of awakeness Bayesian predictive distribution P(y t y <t ) = m (a(m)p(y t y <t, m) + (1 a(m))p(y t y <t )) P(m y <t ) MW (UCSC) Putting the Bayes update to sleep 33 / 43

Bayes for specialists with awake vector a(m) is percentage of awakeness Bayesian predictive distribution P(y t y <t ) = m (a(m)p(y t y <t, m) + (1 a(m))p(y t y <t )) P(m y <t ) has solution P(y t y <t ) = m a(m)p(y t y <t, m)p(m y <t ) m a(m)p(m y. <t) MW (UCSC) Putting the Bayes update to sleep 33 / 43

Bayes for specialists with awake vector a(m) is percentage of awakeness Bayesian predictive distribution P(y t y <t ) = m (a(m)p(y t y <t, m) + (1 a(m))p(y t y <t )) P(m y <t ) has solution P(y t y <t ) = m a(m)p(y t y <t, m)p(m y <t ) m a(m)p(m y. <t) The posterior distribution is incrementally updated by P(m y t ) = (a(m)p(y t y <t, m) + (1 a(m))p(y t y <t ))P(m y <t ) P(y t y <t ) ( = a(m) P(y ) t y <t, m) + (1 a(m)) P(m y <t ) P(y t y <t ) MW (UCSC) Putting the Bayes update to sleep 33 / 43

With awake vector continued How done with in vitro selection? Split P(m y <t ) into - awake part a(m)p(m y <t ) and - asleep part (1 a(m))p(m y <t ) Do multiplicative update on the awake part only so that total weight of that part is not changing Recombine (total weight of posterior is one) MW (UCSC) Putting the Bayes update to sleep 34 / 43

Neural nets ( ) a w x ŷ = σ a w w = a w exp( η.l) }{{} Z awake + (1 a) w }{{} asleep where Z = a w exp( η.l) a w w 1 = 1 When all awake (a = 1) the update becomes w = w exp( η.l) w exp( η.l) When all asleep (a = 0) the update becomes w = w However the prediction does not make sense in this case MW (UCSC) Putting the Bayes update to sleep 35 / 43

The specialists trick Specialists / sleeping trick just a curiosity? MW (UCSC) Putting the Bayes update to sleep 36 / 43

The specialists trick Specialists / sleeping trick just a curiosity? Specialists trick: Start with input models Create many derived virtual specialists Control how they predict Control when they are awake Run Bayes on all these specialists Carefully choose prior Low regret w.r.t. class of intricate combinations of input models. Fast execution. Bayes on specialists collapses MW (UCSC) Putting the Bayes update to sleep 36 / 43

Application 1: Freund s Problem In practice: Large number M of models Some are good some of the time Most are bad all the time We do not know in advance which models will be useful when. Goal: compare favourably on data y T with the best alternation of N M models with B T blocks. T outcomes A B C A C B C A partition model, N = 3 good models, B = 8 blocks Bayesian reflex: average all such partition models. NP hard. MW (UCSC) Putting the Bayes update to sleep 37 / 43

Application 1: sleeping solution Create partition specialists T outcomes zzzzzzzzz... A A A B B C C C Bayes is fast: O(M) time per trial, O(M) space Bayes is good: regret close to information-theoretic lower bound MW (UCSC) Putting the Bayes update to sleep 38 / 43

Application 1: sleeping solution Create partition specialists T outcomes zzzzzzzzz... A A A B B C C C Bayes is fast: O(M) time per trial, O(M) space Bayes is good: regret close to information-theoretic lower bound Bayesian interpretation for Mixing Past Posterior and 2 Track updates Slightly improved regret bound Explain mysterious factor 2 MW (UCSC) Putting the Bayes update to sleep 38 / 43

Application 2: online multitask learning In practice: Large number M of models Data y 1, y 2,... from K interleaved tasks Observe task label κ 1, κ 2,... before prediction We do not know in advance which tasks are similar. Goal: Compare favourably to best clustering of tasks with N K cells K tasks A B A C B A cluster model, N = 3 cells Bayesian reflex: average all cluster models. NP hard. MW (UCSC) Putting the Bayes update to sleep 39 / 43

Application 2: sleeping solution Create cluster specialists A B zzzzzzzzz... K tasks A A B C Bayes is fast: O(M) time per trial, O(MK) space Bayes is good: regret close to information-theoretic lower bound Intriguing algorithm Regret independent of task switch count MW (UCSC) Putting the Bayes update to sleep 40 / 43

Pictorial summary of Bayesian interpretation T outcomes Bayes A fixed expert Fixed Share A B C A C B C A partition models MixingPastPosteriors / 2 track zzzzzzzzz... A A A C C C B B partition specialists MW (UCSC) Putting the Bayes update to sleep 41 / 43

Conclusion The specialists trick Avoids NP-hard offline problem Ditch Bayes on coordinated comparator models Instead create uncoordinated virtual specialists that cover time line Craft prior so that Bayes collapses (efficient emulation) Small regret MW (UCSC) Putting the Bayes update to sleep 42 / 43

Open problems Is sleeping / specialist trick necessary? How related to Chinese Restaurant Problem? Are 3 tracks better than 2 tracks for some applications? Does nature use 3 or more tracks? Practical update for shifting Gaussians? MW (UCSC) Putting the Bayes update to sleep 43 / 43