Putting the Bayes update to sleep

Putting the Bayes update to sleep Manfred Warmuth UCSC AMS seminar 4-13-15 Joint work with Wouter M. Koolen, Dmitry Adamskiy, Olivier Bousquet

Menu How adding one line of code to the multiplicative update imbues the algorithm with a long-term memory and makes it robust for data that shifts between a small number of modes MW (UCSC) Putting the Bayes update to sleep 2 / 43

Menu How adding one line of code to the multiplicative update Outline imbues the algorithm with a long-term memory and makes it robust for data that shifts between a small number of modes The Disk spin down problem The Mixing Past Posterior Update - long term versus short term memory Bayes in the tube The curse of the multiplicative update The two-track update - another update with a long-term memory Theoretical justification: Putting Bayes to sleep Open problems MW (UCSC) Putting the Bayes update to sleep 2 / 43

Disk spindown problem Experts a set of n fixed timeouts : τ 1, τ 2,..., τ n Master Alg. maintains a set of weights / shares: s 1, s 2,..., s n Multiplicative update Problem s t+1,i = s t,i e η energy usage of timeout i Z MW (UCSC) Putting the Bayes update to sleep 3 / 43

A simple way of making the algorithm robust Mix in little bit of uniform vector (initial prior) Has a Bayesian interpretation s = Multiplicative Update s = (1 α) s + α ( 1 N, 1 N,..., 1 N ) where α is small Called Fixed Share to Uniform Past MW (UCSC) Putting the Bayes update to sleep 4 / 43

A better way of making the algorithm robust Keep track of past average share vector r s = Multiplicative Update s = (1 α) s + α r Called Mixing Past Posterior Update (w. uniform distribution on the past) Facilitates switch to previously useful vector Long-term memory No obvious Bayesian interpretation MW (UCSC) Putting the Bayes update to sleep 5 / 43

Total loss as a function of time A D A D G A D MW (UCSC) Putting the Bayes update to sleep 6 / 43

The weights of the models over time trial 0 200 400 600 800 1000 1200 1400 0-5 log weight -10-15 A D G others A D A D G A D MW (UCSC) Putting the Bayes update to sleep 7 / 43

The family of online learning algorithms Bayes MW (UCSC) Putting the Bayes update to sleep 8 / 43

The family of online learning algorithms Bayes homes in fast MW (UCSC) Putting the Bayes update to sleep 8 / 43

The family of online learning algorithms Bayes homes in fast simple, elegant baseline MW (UCSC) Putting the Bayes update to sleep 8 / 43

The family of online learning algorithms Bayes homes in fast simple, elegant baseline Fixed Share MW (UCSC) Putting the Bayes update to sleep 8 / 43

The family of online learning algorithms Bayes homes in fast simple, elegant baseline Fixed Share + adaptivity MW (UCSC) Putting the Bayes update to sleep 8 / 43

The family of online learning algorithms Bayes homes in fast simple, elegant baseline Fixed Share + adaptivity simple, elegant good performance MW (UCSC) Putting the Bayes update to sleep 8 / 43

The family of online learning algorithms Bayes Fixed Share Mixing Past Posteriors homes in fast simple, elegant baseline + adaptivity simple, elegant good performance MW (UCSC) Putting the Bayes update to sleep 8 / 43

The family of online learning algorithms Bayes Fixed Share Mixing Past Posteriors homes in fast + adaptivity + sparsity simple, elegant simple, elegant baseline good performance MW (UCSC) Putting the Bayes update to sleep 8 / 43

The family of online learning algorithms Bayes Fixed Share Mixing Past Posteriors homes in fast + adaptivity + sparsity simple, elegant simple, elegant self-referential mess baseline good performance great performance MW (UCSC) Putting the Bayes update to sleep 8 / 43

Our breakthrough Bayesian interpretation Mixing Past Posteriors Mixing Past Posteriors MW (UCSC) Putting the Bayes update to sleep 9 / 43

Quick lookahead! The trick is to put some models to sleep Asleep model don t predict Their weights not updated Later... The Gaussian Mixture example MW (UCSC) Putting the Bayes update to sleep 10 / 43

The updates in Bayesian notation Fixed Share s temp m = s new m P(y m) sold m j P(y m ) s old m Bayes posterior = (1 α) s temp m + α uniform Bayes over partition models MW (UCSC) Putting the Bayes update to sleep 11 / 43

The updates in Bayesian notation Fixed Share s temp m = s new m P(y m) sold m j P(y m ) s old m Bayes posterior = (1 α) s temp m + α uniform Bayes over partition models Mixing Past Posteriors s temp m = s new m P(y m) sold m j P(y m ) sm old = (1 α) sm temp Bayes posterior + α past average s temp m weights MW (UCSC) Putting the Bayes update to sleep 11 / 43

The updates in Bayesian notation Fixed Share s temp m = s new m P(y m) sold m j P(y m ) s old m Bayes posterior = (1 α) s temp m + α uniform Bayes over partition models Mixing Past Posteriors s temp m = s new m P(y m) sold m j P(y m ) sm old = (1 α) sm temp Bayes posterior + α past average s temp m weights Self-referential mess Will give Bayesian interpretation using the sleeping trick MW (UCSC) Putting the Bayes update to sleep 11 / 43

Multiplicative updates Blessing Speed Curse Loss of diversity How to maintain variety when multiplicative updates are used MW (UCSC) Putting the Bayes update to sleep 12 / 43

Multiplicative updates Blessing Speed Curse Loss of diversity How to maintain variety when multiplicative updates are used Next Implementation of Bayes in the tube Towards a rudimentary model of evolution MW (UCSC) Putting the Bayes update to sleep 12 / 43

In-vitro selection algorithm - for finding RNA strand that binds to protein Make tube of random RNA strands protein sheet for t=1 to 20 do 1 Dip sheet with protein into tube 2 Pull out and wash off RNA that stuck 3 Multiply washed off RNA back to original amount (normalization) MW (UCSC) Putting the Bayes update to sleep 13 / 43

In-vitro selection algorithm - for finding RNA strand that binds to protein Make tube of random RNA strands protein sheet for t=1 to 20 do 1 Dip sheet with protein into tube 2 Pull out and wash off RNA that stuck 3 Multiply washed off RNA back to original amount (normalization) Can t be done with computer because 10 15 RNA strands per liter of soup MW (UCSC) Putting the Bayes update to sleep 13 / 43

Basic scheme Start with unit amount of random RNA Loop 1 Functional separation into good RNA and bad RNA 2 Amplify good RNA to unit amount MW (UCSC) Putting the Bayes update to sleep 14 / 43

Duplicating DNA with PCR needed for normalization step Invented by Kary Mullis 1985 Heat - double stranded DNA comes apart Cool - Short primers hybridize at the ends Taq Polymerase runs along DNA strand and complements bases Back to double stranded DNA MW (UCSC) Putting the Bayes update to sleep 15 / 43

Duplicating DNA with PCR needed for normalization step Invented by Kary Mullis 1985 Heat - double stranded DNA comes apart Cool - Short primers hybridize at the ends Taq Polymerase runs along DNA strand and complements bases Back to double stranded DNA Ideally all DNA strands multiplied by factor of 2 MW (UCSC) Putting the Bayes update to sleep 15 / 43

The caveat Not many interesting/specific functional separations found Need high throughput for functional separation MW (UCSC) Putting the Bayes update to sleep 16 / 43

Mathematical description of in-vitro selection Name the unknown RNA strands as 1, 2,... n where n 10 15 (# of strands in 1 liter) s i 0 is share of RNA i MW (UCSC) Putting the Bayes update to sleep 17 / 43

Mathematical description of in-vitro selection Name the unknown RNA strands as 1, 2,... n where n 10 15 (# of strands in 1 liter) s i 0 is share of RNA i Contents of tube represented as share vector s = (s 1, s 2,..., s n ) When tube has unit amount, then s 1 + s 2 + + s n = 1 W i [0, 1] is the fraction of one unit of RNA i that is good W i is fitness of RNA strand i for attaching to protein Protein represented as fitness vector W = (W 1, W 2,..., W n ) MW (UCSC) Putting the Bayes update to sleep 17 / 43

Mathematical description of in-vitro selection Name the unknown RNA strands as 1, 2,... n where n 10 15 (# of strands in 1 liter) s i 0 is share of RNA i Contents of tube represented as share vector s = (s 1, s 2,..., s n ) When tube has unit amount, then s 1 + s 2 + + s n = 1 W i [0, 1] is the fraction of one unit of RNA i that is good W i is fitness of RNA strand i for attaching to protein Protein represented as fitness vector W = (W 1, W 2,..., W n ) Vectors s, W unknown: Blind Computation! MW (UCSC) Putting the Bayes update to sleep 17 / 43

Mathematical description of in-vitro selection Name the unknown RNA strands as 1, 2,... n where n 10 15 (# of strands in 1 liter) s i 0 is share of RNA i Contents of tube represented as share vector s = (s 1, s 2,..., s n ) When tube has unit amount, then s 1 + s 2 + + s n = 1 W i [0, 1] is the fraction of one unit of RNA i that is good W i is fitness of RNA strand i for attaching to protein Protein represented as fitness vector W = (W 1, W 2,..., W n ) Vectors s, W unknown: Blind Computation! Strong assumption: Fitness W i independent of share vector s MW (UCSC) Putting the Bayes update to sleep 17 / 43

Update Good RNA in tube s: s 1 W 1 + s 2 W 2 + + s n W n = s W Amplification: Good share of RNA i is s i W i - multiplied by factor F If precise, then all good RNA multiplied by same factor F Final tube at end of loop F s W Since final tube has unit amount of RNA Update in each loop F s W = 1 and F = 1 s W s i := s iw i s W MW (UCSC) Putting the Bayes update to sleep 18 / 43

Implementing Bayes rule in RNA s i is prior P(i) W i [0..1] is probability P(Good i) s W is probability P(Good) s i := s iw i s W is Bayes rule P(i Good) = }{{} posterior prior {}}{ P(i) P(Good i) P(Good) MW (UCSC) Putting the Bayes update to sleep 19 / 43

Iterating Bayes Rule with same data likelihoods W 1 =.9 s 1 shares W 2 =.75 s 2 s 3 W 3 =.8 Initial s = (.01,.39,.6) W = (.9,.75,.8) t MW (UCSC) Putting the Bayes update to sleep 20 / 43

max i W i eventually wins.9 shares.84.7 Largest x i always increasing Smallest x i always decreasing.85 s t,i = s 0,i W i t normalization t is speed parameter MW (UCSC) Putting the Bayes update to sleep 21 / 43

The best are too good Multiplicative update s i s i fitness factor i }{{} 0 MW (UCSC) Putting the Bayes update to sleep 22 / 43

The best are too good Multiplicative update s i s i fitness factor i }{{} 0 Blessing Best get amplified exponentially fast Can do this with 10 15 variables MW (UCSC) Putting the Bayes update to sleep 22 / 43

The best are too good Multiplicative update s i s i fitness factor i }{{} 0 Blessing Best get amplified exponentially fast Can do this with 10 15 variables Curse The best wipe out the others Loss of diversity MW (UCSC) Putting the Bayes update to sleep 22 / 43

View of evolution Simple view Inheritance - mutation - selection for the fittest with multiplicative update MW (UCSC) Putting the Bayes update to sleep 23 / 43

View of evolution Simple view Inheritance - mutation - selection for the fittest with multiplicative update New view Multiplicative updates are brittle A mechanism is needed to ameliorate the curse of the multiplicative update Machine Learning and Nature uses various mechanisms Here we focus on the sleeping trick MW (UCSC) Putting the Bayes update to sleep 23 / 43

Long term memory with the 2 Track Update Multiplicative update + small soup exchange gives long-term memory MW (UCSC) Putting the Bayes update to sleep 24 / 43

Long term memory with the 2 Track Update Multiplicative update + small soup exchange gives long-term memory Awake Selection+PCR γ Asleep 1 γ γ = Pass through β α+β α β 1 α 1 β Selection+PCR Pass through α β 1 α 1 β MW (UCSC) Putting the Bayes update to sleep 24 / 43

2 sets of weights Awake weights: a Asleep weights: s 1. update: a m = multiplicative update of a s m = s ie pass thru 2. update: ã = (1 α) a m + β s m s = α a m + (1 β) s m MW (UCSC) Putting the Bayes update to sleep 25 / 43

Total loss as a function of time A D A D G A D MW (UCSC) Putting the Bayes update to sleep 26 / 43

The weights a of the awake models over time trial 0 200 400 600 800 1000 1200 1400 0-5 log weight -10-15 A D G others A D A D G A D MW (UCSC) Putting the Bayes update to sleep 27 / 43

The weights s of the asleep models over time trial 0 200 400 600 800 1000 1200 1400 0 log weight -5-10 -15 A D G others A D A D G A D MW (UCSC) Putting the Bayes update to sleep 28 / 43

Bayesian interpretation Point of departure: - We are trying to predict a sequence y 1, y 2,.... Definition A model issues a prediction each round. We denote the prediction of model m in round t by P(y t y <t, m). Say we have several models m = 1,..., M. How to combine their predictions? MW (UCSC) Putting the Bayes update to sleep 29 / 43

Bayesian answer Place a prior P(m) on models. Bayesian predictive distribution P(y t y <t ) = M P(y t y <t, m)p(m y <t ) m=1 where posterior distribution is incrementally updated by P(m y t ) = P(y t y <t, m)p(m y <t ) P(y t y <t ) Bayes is fast: predict in O(M) time per round. Bayes is good: regret w.r.t. model m on data y T bounded by T ( ln P(y t y <t ) ( ln P(y t y <t, m))) ln P(m) t=1 MW (UCSC) Putting the Bayes update to sleep 30 / 43

Specialists Definition (FSSW97,CV09) A specialist may or may not issue a prediction. Prediction P(y t y <t, m) only available for awake m W t. Say we have several specialists m = 1,..., M. How to combine their predictions? If we imagine a prediction for the sleeping specialists, we can use Bayes Choice: sleeping specialists predict with Bayesian predictive distribution That sounds circular. Because it is. $%!#? MW (UCSC) Putting the Bayes update to sleep 31 / 43

Bayes for specialists Place a prior P(m) on models. Bayesian predictive distribution P(y t y <t ) = P(y t y <t, m)p(m y <t ) + P(y t y <t )P(m y <t ) m W t m / W t MW (UCSC) Putting the Bayes update to sleep 32 / 43

Bayes for specialists Place a prior P(m) on models. Bayesian predictive distribution P(y t y <t ) = P(y t y <t, m)p(m y <t ) + P(y t y <t )P(m y <t ) m W t m / W t has solution P(y t y <t ) = m W t P(y t y <t, m)p(m y <t ). m W t P(m y <t ) MW (UCSC) Putting the Bayes update to sleep 32 / 43

Bayes for specialists Place a prior P(m) on models. Bayesian predictive distribution P(y t y <t ) = P(y t y <t, m)p(m y <t ) + P(y t y <t )P(m y <t ) m W t m / W t has solution m W P(y t y <t ) = t P(y t y <t, m)p(m y <t ). m W t P(m y <t ) The posterior distribution is incrementally updated by P(y t y <t,m)p(m y <t) P(y t y <t) if m W t, P(m y t ) = P(y t y <t)p(m y <t) P(y t y = P(m y <t) <t ) if m / W t. Bayes is fast: predict in O(M) time per round. Bayes is good: regret w.r.t. model m on data y T bounded by ( ln P(y t y <t ) + ln P(y t y <t, m)) ln P(m). t : 1 t T and m W t MW (UCSC) Putting the Bayes update to sleep 32 / 43

Bayes for specialists with awake vector a(m) is percentage of awakeness Bayesian predictive distribution P(y t y <t ) = m (a(m)p(y t y <t, m) + (1 a(m))p(y t y <t )) P(m y <t ) MW (UCSC) Putting the Bayes update to sleep 33 / 43

Bayes for specialists with awake vector a(m) is percentage of awakeness Bayesian predictive distribution P(y t y <t ) = m (a(m)p(y t y <t, m) + (1 a(m))p(y t y <t )) P(m y <t ) has solution P(y t y <t ) = m a(m)p(y t y <t, m)p(m y <t ) m a(m)p(m y. <t) The posterior distribution is incrementally updated by P(m y t ) = (a(m)p(y t y <t, m) + (1 a(m))p(y t y <t ))P(m y <t ) P(y t y <t ) ( = a(m) P(y ) t y <t, m) + (1 a(m)) P(m y <t ) P(y t y <t ) MW (UCSC) Putting the Bayes update to sleep 33 / 43

With awake vector continued How done with in vitro selection? Split P(m y <t ) into - awake part a(m)p(m y <t ) and - asleep part (1 a(m))p(m y <t ) Do multiplicative update on the awake part only so that total weight of that part is not changing Recombine (total weight of posterior is one) MW (UCSC) Putting the Bayes update to sleep 34 / 43

Neural nets ( ) a w x ŷ = σ a w w = a w exp( η.l) }{{} Z awake + (1 a) w }{{} asleep where Z = a w exp( η.l) a w w 1 = 1 When all awake (a = 1) the update becomes w = w exp( η.l) w exp( η.l) When all asleep (a = 0) the update becomes w = w However the prediction does not make sense in this case MW (UCSC) Putting the Bayes update to sleep 35 / 43

The specialists trick Specialists / sleeping trick just a curiosity? MW (UCSC) Putting the Bayes update to sleep 36 / 43

The specialists trick Specialists / sleeping trick just a curiosity? Specialists trick: Start with input models Create many derived virtual specialists Control how they predict Control when they are awake Run Bayes on all these specialists Carefully choose prior Low regret w.r.t. class of intricate combinations of input models. Fast execution. Bayes on specialists collapses MW (UCSC) Putting the Bayes update to sleep 36 / 43

Application 1: Freund s Problem In practice: Large number M of models Some are good some of the time Most are bad all the time We do not know in advance which models will be useful when. Goal: compare favourably on data y T with the best alternation of N M models with B T blocks. T outcomes A B C A C B C A partition model, N = 3 good models, B = 8 blocks Bayesian reflex: average all such partition models. NP hard. MW (UCSC) Putting the Bayes update to sleep 37 / 43

Application 1: sleeping solution Create partition specialists T outcomes zzzzzzzzz... A A A B B C C C Bayes is fast: O(M) time per trial, O(M) space Bayes is good: regret close to information-theoretic lower bound Bayesian interpretation for Mixing Past Posterior and 2 Track updates Slightly improved regret bound Explain mysterious factor 2 MW (UCSC) Putting the Bayes update to sleep 38 / 43

Application 2: online multitask learning In practice: Large number M of models Data y 1, y 2,... from K interleaved tasks Observe task label κ 1, κ 2,... before prediction We do not know in advance which tasks are similar. Goal: Compare favourably to best clustering of tasks with N K cells K tasks A B A C B A cluster model, N = 3 cells Bayesian reflex: average all cluster models. NP hard. MW (UCSC) Putting the Bayes update to sleep 39 / 43

Application 2: sleeping solution Create cluster specialists A B zzzzzzzzz... K tasks A A B C Bayes is fast: O(M) time per trial, O(MK) space Bayes is good: regret close to information-theoretic lower bound Intriguing algorithm Regret independent of task switch count MW (UCSC) Putting the Bayes update to sleep 40 / 43

Pictorial summary of Bayesian interpretation T outcomes Bayes A fixed expert Fixed Share A B C A C B C A partition models MixingPastPosteriors / 2 track zzzzzzzzz... A A A C C C B B partition specialists MW (UCSC) Putting the Bayes update to sleep 41 / 43

Conclusion The specialists trick Avoids NP-hard offline problem Ditch Bayes on coordinated comparator models Instead create uncoordinated virtual specialists that cover time line Craft prior so that Bayes collapses (efficient emulation) Small regret MW (UCSC) Putting the Bayes update to sleep 42 / 43

Open problems Is sleeping / specialist trick necessary? How related to Chinese Restaurant Problem? Are 3 tracks better than 2 tracks for some applications? Does nature use 3 or more tracks? Practical update for shifting Gaussians? MW (UCSC) Putting the Bayes update to sleep 43 / 43