The Blessing and the Curse of the Multiplicative Updates Manfred K. Warmuth University of California, Santa Cruz CMPS 272, Feb 31, 2012 Thanks to David Ilstrup and Anindya Sen for helping with the slides Manfred K. Warmuth (UCSC) The Blessing and the Curse 1 / 51
Multiplicative updates Definition Algorithm maintains a weight vector Weights multiplied by non-negative factors Motivated by relative entropies as regularizer Big Questions Why are multiplicative updates so prevalent in nature? Are additive updates used? Are matrix versions of multiplicative updates used? Manfred K. Warmuth (UCSC) The Blessing and the Curse 2 / 51
Machine learning vs nature Typical multiplicative update in ML Set of experts predict something on a trial by trial basis s i is belief in of ith expert at trial t Update: [V,LW] s i := s i e ηloss i Z where η learning rate and Z normalizer Bayes update is special case Nature Weights are concentrations / shares in a population Next: implementation of Bayes in the tube Manfred K. Warmuth (UCSC) The Blessing and the Curse 3 / 51
Multiplicative updates Blessing Speed Curse Loss of variety Will give three machine learning methods for preventing the curse and discuss related methods used in nature Manfred K. Warmuth (UCSC) The Blessing and the Curse 4 / 51
In-vitro selection algorithm for finding RNA strand that binds to protein Make tube of random RNA strands protein sheet for t=1 to 20 do 1 Dip sheet with protein into tube 2 Pull out and wash off RNA that stuck 3 Multiply washed off RNA back to original amount (normalization) Manfred K. Warmuth (UCSC) The Blessing and the Curse 5 / 51
Modifications Start with modifications of a good candidate RNA strand Put a similar protein into solution Now attaching must be more specific Selection for a better variant Can t be done with computer because 10 20 RNA strands per liter of soup Manfred K. Warmuth (UCSC) The Blessing and the Curse 6 / 51
Basic scheme Start with unit amount of random RNA Loop 1 Functional separation into good RNA and bad RNA 2 Amplify good RNA to unit amount Manfred K. Warmuth (UCSC) The Blessing and the Curse 7 / 51
Duplicating DNA with PCR Invented by Kary Mullis 1985 Short primers hybridize at the ends Polymerase runs along DNA strand and complements bases Ideally all DNA strands multiplied by factor of 2 Manfred K. Warmuth (UCSC) The Blessing and the Curse 8 / 51
In-vitro selection algorithm for proteins DNA strands chemically similar RNA strands more varied and can form folded structures (Conjectured that first enzymes made of RNA) DNA Letters 4 RNA (Ribosome) Protein 4 20 Proteins are the materials of life For any material property, a protein can be found How to select for a protein with certain functionality? Manfred K. Warmuth (UCSC) The Blessing and the Curse 9 / 51
Tag proteins with their DNA Manfred K. Warmuth (UCSC) The Blessing and the Curse 10 / 51
In-vitro selection for protein Initialize tube with tagged random protein strands (tag of a protein is its RNA coding) Loop 1 Functional separation 2 Retrieve tags from good protein 3 Reproduce tagged protein from retrieved tags with Ribosome Almost doable... Manfred K. Warmuth (UCSC) The Blessing and the Curse 11 / 51
The caveat Not many interesting/specific functional separations found Need high throughput for functional separation Manfred K. Warmuth (UCSC) The Blessing and the Curse 12 / 51
Mathematical description of in-vitro selection Name the unknown RNA strands as 1, 2,... n where n 10 20 (# of strands in 1 liter) s i 0 is share of RNA i Contents of tube represented as share vector s = (s 1, s 2,..., s n ) When tube has unit amount, then s 1 + s 2 + + s n = 1 W i [0, 1] is the fraction of one unit of RNA i that is good W i is fitness of RNA strand i for attaching to protein Protein represented as fitness vector W = (W 1, W 2,..., W n ) Vectors s, W unknown: Blind Computation! Strong assumption: Fitness W i independent of share vector s Manfred K. Warmuth (UCSC) The Blessing and the Curse 13 / 51
Update Good RNA in tube s: s 1 W 1 + s 2 W 2 + + s n W n = s W Bad RNA: s 1 (1 W 1 ) + s 2 (1 W 2 ) + + s n (1 W n ) = 1 s W Amplification: Good share of RNA i is s i W i - multiplied by factor F If precise, then all good RNA multiplied by same factor F Final tube at end of loop F s W Since final tube has unit amount of RNA Update in each loop F s W = 1 and F = 1 s W s i := s iw i s W Manfred K. Warmuth (UCSC) The Blessing and the Curse 14 / 51
Implementing Bayes rule in RNA s i is prior P(i) W i [0..1] is probability P(Good i) s W is probability P(Good) s i := s iw i s W is Bayes rule P(i Good) = }{{} posterior prior {}}{ P(i) P(Good i) P(Good) Manfred K. Warmuth (UCSC) The Blessing and the Curse 15 / 51
Iterating Bayes Rule with same data likelihoods W 1 =.9 s 1 shares W 2 =.75 s 2 s 3 W 3 =.8 Initial s = (.01,.39,.6) W = (.9,.75,.8) t Manfred K. Warmuth (UCSC) The Blessing and the Curse 16 / 51
max i W i eventually wins.9 shares.84.7 Largest x i always increasing Smallest x i always decreasing.85 s 0,i W i t normalization s t,i = t is speed parameter Manfred K. Warmuth (UCSC) The Blessing and the Curse 17 / 51
Negative range of t reveals sigmoids Smallest: Reverse Sigmoid Largest: Sigmoid.3.49.54.5.5 takes over.49 Manfred K. Warmuth (UCSC) The Blessing and the Curse 18 / 51
The best are too good Multiplicative update s i s i fitness factor i }{{} 0 Blessing: Best get amplified exponentially fast Curse: The best wipe out the others Loss of variety Manfred K. Warmuth (UCSC) The Blessing and the Curse 19 / 51
A key problem 3 Strands of RNA Want to amplify mixture while keeping percentages unchanged A 15% B 60% C 25% Iterative amplification will favor one! Assumption: W i independent of shares Manfred K. Warmuth (UCSC) The Blessing and the Curse 20 / 51
Solution Make one long strand with right percentages of A,B,C A B B B C A B B Amplify same long strand many times Chop into constituents Long strand functions as chromosome Free floating genes in the nucleus would compete Loss of variety Manfred K. Warmuth (UCSC) The Blessing and the Curse 21 / 51
Coupling preserves diversity A B PCR A A Un-coupled A and B compete A+B PCR A+B Coupled A and B should cooperate Manfred K. Warmuth (UCSC) The Blessing and the Curse 22 / 51
Questions: What updates to s represent an iteration of an in-vitro selection algorithm? What updates are possible with blind computation? Must maintain non-negative weights Assumptions: s independent of W s W measureable So far: s i := s i W i normaliz. More tempered update: s i γ s i (1 W i ) + s }{{} i W i, 0 γ < 1 }{{} BAD GOOD s i (γ(1 W i ) + W i ) }{{} 0 Manfred K. Warmuth (UCSC) The Blessing and the Curse 23 / 51
More challenging goal Find set of RNA strands that binds to a number of different proteins P 1 P 2 P 3 P 4 P 5 W 1 W 2 W 3 W 4 W 5 t : 1 2 3 4 5 In trial t, functional separation based on fitness vector W t So far all W t the same... Manfred K. Warmuth (UCSC) The Blessing and the Curse 24 / 51
In-vitro selection algorithm P 1 P 2 P 3 P 4 P 5 W t,1 W t,2 u W t 0.9 0.8 0.96 0.2 0.04 0.1 0.01 0.8 0.9 0.8 0.5 0.405 0.88 0.55 0.42 Goal Tube Tube: u = (0.5, 0.5, 0,..., 0) The two strands cooperatively solve the problem Related to disjunction of two variables Manfred K. Warmuth (UCSC) The Blessing and the Curse 25 / 51
Key problem 2 Goal Starting with 1 liter tube in which all strands appear with frequency 10 20, use PCR and..., to arrive at tube : ( 0.5, 0.5, 0,..., 0) Problem Over train with P 1 and P 2 s 1 1, s 2 0 Over train with P 4 and P 5 s 1 0, s 2 1 Want blind computation! W t and initial s unknown Need some kind of feedback in each trial Manfred K. Warmuth (UCSC) The Blessing and the Curse 26 / 51
Related machine learning problem Tube share vector s probability simplex N Proteins Example vectors W t {0, 1} N Assumption u = (0, 0, 1 k, 0, 0, 1 k, 0, 0, 1 k, 0, 0, 0) w. k non-zero components of value 1 k s.t. t : u W t 1 2k Goal Find s : s W t 1 2k Manfred K. Warmuth (UCSC) The Blessing and the Curse 27 / 51
Normalized Winnow Algorithm (N. Littlestone) Do passes over examples W t if s t 1 W t 1 2k then s t := s t 1 conservative update { st 1,i if W else s t,i := t,i = 0 αs t 1,i if W t,i = 1, where α > 1 and re-normalize O(k log N k ) updates needed when labels consistent with k-literal disjunction Logarithmic in N Optimal to within constant factors Manfred K. Warmuth (UCSC) The Blessing and the Curse 28 / 51
Back to in-vitro selection Fix If good side large enough, then don t update if s t 1 W t δ : then s t := s t 1 conservative update else s t,i := s t 1,i ((1 s t,i )γ t + s t,i ), where, 0 γ t < 1 }{{} non-neg.factor and re-normalize Can simulate Normalized Winnow Concerns How exactly can we measure s W? Some RNA might be inactive! Manfred K. Warmuth (UCSC) The Blessing and the Curse 29 / 51
Problems with in-vitro selection Story so far For learning multiple goals - prevent overtraining with conservative update Alternate trick Cap weights - Start with nature - How we used same trick in Machine learning Manfred K. Warmuth (UCSC) The Blessing and the Curse 30 / 51
Alternate trick: cap weights Super predator algorithm Preserves variety c Lion copyrights belong to Andy Warhol Manfred K. Warmuth (UCSC) The Blessing and the Curse 31 / 51
Alternate trick: cap weights 0.25 0.2 0.15 0.1 0.05 0 1 2 3 4 5 6 s i = s i e η Loss i Z s = capped and rescaled version of s Cap and re-scale rest Manfred K. Warmuth (UCSC) The Blessing and the Curse 32 / 51
Why capping? m sets encoded as probability vectors (0, 1 m, 0, 0, 1 m, 0, 1 m ) called m-corners ( n m ) of them The convex hull of the m-corners = capped probability simplex (½,0, ½) (0,½, ½) (½,½,0) Combined update converges to best corner of capped simplex - best set of size m Capping lets us learn multiple goals How to cap in in-vitro selection setting? Manfred K. Warmuth (UCSC) The Blessing and the Curse 33 / 51
What if environment changes over time Initial tube contains 10 20 different strands After a while - Loss of variety - Some selfish strand manages to dominate without doing the wanted function Fixes - Mix in a little bit of initial soup for enrichment - Or do sloppy PCR that introduces mutations Manfred K. Warmuth (UCSC) The Blessing and the Curse 34 / 51
Disk spindown problem Experts a set of n fixed timeouts : W 1, W 2,..., W n Master Alg. maintains a set of weights : s 1, s 2,..., s n Multiplicative update Problem s t+1,i = s t,i e η energy usage of timeout i Z Burst favors long timeout Coefficients of small timeouts wiped out Manfred K. Warmuth (UCSC) The Blessing and the Curse 35 / 51
Disk spindown problem Fix Mix in little bit of uniform vector s = Multiplicative Update s = (1 α) s + α ( 1 N, 1 N,..., 1 ), α is small N Nature uses mutation for same purpose Better fix Keep track of past average share vector r s = Multiplicative Update s = (1 α) s + α r - facilitates switch to previously useful vector - long-term memory Manfred K. Warmuth (UCSC) The Blessing and the Curse 36 / 51
Question Long Term Memory How is it realized in nature? Summary Sex? Junk DNA? 1 Conservative update - learn multiple goals 2 Upper bounding weights - 3 Lower bounding weights - robust against change Manfred K. Warmuth (UCSC) The Blessing and the Curse 37 / 51
Motivations of the updates? Motivate additive and multiplicative updates Use linear regression as example problem Deep underlying question What are updates used by nature and why? Manfred K. Warmuth (UCSC) The Blessing and the Curse 38 / 51
Online linear regression For t = 1, 2,... Get instance x t R n Predict ŷ t = s t x t Get label y t R Incur square loss (y t ŷ t ) 2 Update s t s t+1 y y t } w t ŷ t x t x Manfred K. Warmuth (UCSC) The Blessing and the Curse 39 / 51
Two main update families - linear regression Additive s t+1 = s t η (s t x t y t )x t }{{} Gradient of Square Loss Motivated by squared Euclidean distance Weights can go negative Gradient Descent (GD) - SKIING Multiplicative s t+1,i = s t,i e η(s t x t y t )x t,i Z t Motivated by relative entropy Updated weight vector stays on probability simplex Exponentiated Gradient (EG) - LIFE! [KW97] Manfred K. Warmuth (UCSC) The Blessing and the Curse 40 / 51
Additive Updates Goal Minimize tradeoff between closeness to last share vector and loss on last example s t+1 = argmin s U(s) = s s t 2 2 }{{} divergence η > 0 is the learning rate/speed U(s) +η (s x t y t ) 2 }{{} loss Manfred K. Warmuth (UCSC) The Blessing and the Curse 41 / 51
Additive updates Therefore, U(s) s i si =s t+1,i = 2(s t+1,i s t,i ) + 2η(s x t y t )x t,i = 0 implicit: s t+1 = s t η(s t+1 x t y t )x t explicit: = s t η(s t x t y t )x t Manfred K. Warmuth (UCSC) The Blessing and the Curse 42 / 51
Multiplicative updates s t+1 = argmin U(s) i s i =1 where U(s) = s i ln s i +η(s x t y t ) 2 s }{{ t,i } relative entropy Define Lagrangian L(s) = s i ln s i where λ Lagrange coeff. s t,i + η(s x t y t ) 2 + λ( i s i 1) Manfred K. Warmuth (UCSC) The Blessing and the Curse 43 / 51
Multiplicative updates L(s) s i = ln s i s t,i + 1 + η(s x t y t )x t,i + λ = 0 ln s t+1,i s t,i = η(s t+1 x t y t )x t,i λ 1 s t+1,i = s t,i e η(s t+1 x t y t )x t,i e λ 1 Enforce normalization constraint by setting e λ 1 to 1/Z t implicit: s t+1,i = s t,ie ηs t+1 x t y t )x t,i Z t explicit: = s t,ie ηs t x t y t )x t,i Z t Manfred K. Warmuth (UCSC) The Blessing and the Curse 44 / 51
Connection to Biology One species for each of the n dimension Fitness rates of species i in time interval (t, t + 1] clamped to w i = η(s t+1 x t y t )x t,i Fitness W i = e w i Algorithm can be seen as running a population Can t go to continuous time, since data points arrive at discrete times 1,2,3,... Manfred K. Warmuth (UCSC) The Blessing and the Curse 45 / 51
GD via differential equations Here f (x t ) = f (x t ) t Forward Euler s t+h s t h = η L(s t ) s t+h = s t ηh L(s t ) explicit: s t+1,i h=1 = s t η L(s t ) s t = η L(s t ) Two ways to discretize: Backward Euler s t+h h s t+h h = η L(s t+h ) s t+h = s t ηh L(s t+h ) implicit: s t+1 h=1 = s t η L(s t+1 ) Manfred K. Warmuth (UCSC) The Blessing and the Curse 46 / 51
EGU via differential equation Two ways to discretize: log s t = η L(s t ) Forward Euler log s t+h,i log s t,i h = η L(s t ) i s t+h,i = s t,i e ηh L(s t ) i explicit: s t+1,i h=1 = s t,i e η L(s t ) i Backward Euler log s t+h h,i log s t+h,i h = η L(s t+h ) i s t+h,i = s t,i e ηh L(s t+h) i implicit: s t+1,i h=1 = s t,i e η L(s t+1) i Derivation of normalized update more involved Manfred K. Warmuth (UCSC) The Blessing and the Curse 47 / 51
EG via differential equation Handle normalization by going one dimension lower ( ln Backward Euler ln s t+1,i 1 n 1 j=1 s t+1,j s t,i 1 n 1 j=1 s t,j ln ) s t,i 1 n 1 j=1 s t,j = η( L(s t ) i L(s t ) n ) = η( L(s t+1 ) i L(s t+1 ) n ) implicit : s t+1,i = s t,i e η L(s t+1) i n j=1 s t,j e η L(s t+1) j Manfred K. Warmuth (UCSC) The Blessing and the Curse 48 / 51
Motivation of Bayes Rule s t+1 = argmin U(s) i s i =1 where U(s) = s i ln s i s t,i }{{} relative entropy + s i ( ln P(y t y t 1,..., y 0, i)) i Plug in solution s t+1,i = s t,i P(y t y t 1,..., y 0 ) P(y t y t 1,..., y 0 ) U(s t+1 ) = ln s t,i P(y t y t 1,..., y 0 ) i = ln P(y t y t 1,..., y 0 ) Manfred K. Warmuth (UCSC) The Blessing and the Curse 49 / 51
Summary Multiplicative updates converge quickly, but wipe out diversity Changing conditions require reuse of previously learned knowledge/alternatives Diversity is a requirement for success Multiplicative updates occur in nature Diversity preserved through mutations, linking, super-predator Machine learning preserves diversity through conservative update, capping, lower bounding weights, weighted mixtures of the past,... Manfred K. Warmuth (UCSC) The Blessing and the Curse 50 / 51
Final question? Does nature use the matrix version of the multiplicative updates? Parameter is density matrix - maintains uncertainty over directions as symmetric positive matrix S exp(logs η L(S)) S := trace of above Quantum relative entropy as regularizer Manfred K. Warmuth (UCSC) The Blessing and the Curse 51 / 51