The Blessing and the Curse

Similar documents
Putting the Bayes update to sleep

On-line Variance Minimization

Online Kernel PCA with Entropic Matrix Updates

Leaving The Span Manfred K. Warmuth and Vishy Vishwanathan

Online Learning. Jordan Boyd-Graber. University of Colorado Boulder LECTURE 21. Slides adapted from Mohri

1 Review and Overview

Logistic Regression. COMP 527 Danushka Bollegala

Other Topologies. Y. LeCun: Machine Learning and Pattern Recognition p. 5/3

Ad Placement Strategies

Online Kernel PCA with Entropic Matrix Updates

MACHINE LEARNING AND PATTERN RECOGNITION Fall 2005, Lecture 4 Gradient-Based Learning III: Architectures Yann LeCun

The Free Matrix Lunch

Linear classifiers: Logistic regression

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

ECE521 week 3: 23/26 January 2017

Ad Placement Strategies

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Randomized Online PCA Algorithms with Regret Bounds that are Logarithmic in the Dimension

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

A Simple Protein Synthesis Model

1 Review of Winnow Algorithm

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Convolutional Neural Networks

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade

Bayesian Learning Extension

UNIT 5. Protein Synthesis 11/22/16

Machine Learning Basics III

Bregman Divergences for Data Mining Meta-Algorithms

Online Passive-Aggressive Algorithms

Worst-Case Analysis of the Perceptron and Exponentiated Update Algorithms

Logistic Regression. Machine Learning Fall 2018

Logistic Regression & Neural Networks

STA 414/2104: Machine Learning

Today. Calculus. Linear Regression. Lagrange Multipliers

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

CS Machine Learning Qualifying Exam

UC Santa Cruz UC Santa Cruz Electronic Theses and Dissertations

Series 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning)

Matrix Exponentiated Gradient Updates for On-line Learning and Bregman Projection

In this lecture, we will consider how to analyse an electrical circuit by applying KVL and KCL. As a result, we can predict the voltages and currents

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Classification. Jordan Boyd-Graber University of Maryland WEIGHTED MAJORITY. Slides adapted from Mohri. Jordan Boyd-Graber UMD Classification 1 / 13

Neural Networks 2. 2 Receptive fields and dealing with image inputs

Online Convex Optimization. Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016

MODULE -4 BAYEIAN LEARNING

Logistic Regression Logistic

Predicting Protein Functions and Domain Interactions from Protein Interactions

STA 4273H: Statistical Machine Learning

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Logistic Regression. INFO-2301: Quantitative Reasoning 2 Michael Paul and Jordan Boyd-Graber SLIDES ADAPTED FROM HINRICH SCHÜTZE

Introduction to Logistic Regression and Support Vector Machine

Support Vector Machines

Machine Learning for NLP

Gene Switches Teacher Information

STA414/2104 Statistical Methods for Machine Learning II

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Relative Loss Bounds for Multidimensional Regression Problems

Multilayer Perceptron

Online Passive-Aggressive Algorithms

1. In most cases, genes code for and it is that

Statistical Machine Learning from Data

1 Review of the Perceptron Algorithm

Part 1: Expectation Propagation

Data Mining Part 5. Prediction

Putting Bayes to sleep

STA 4273H: Statistical Machine Learning

Adaptive Gradient Methods AdaGrad / Adam

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley

COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017

Neural Network Training

Statistical Computing (36-350)

Advanced computational methods X Selected Topics: SGD

More Protein Synthesis and a Model for Protein Transcription Error Rates

K-Lists. Anindya Sen and Corrie Scalisi. June 11, University of California, Santa Cruz. Anindya Sen and Corrie Scalisi (UCSC) K-Lists 1 / 25

CPSC 540: Machine Learning

Machine Learning Lecture 5

Computational Biology: Basics & Interesting Problems

GCD3033:Cell Biology. Transcription

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

CS598 Machine Learning in Computational Biology (Lecture 5: Matrix - part 2) Professor Jian Peng Teaching Assistant: Rongda Zhu

Machine Learning Lecture Notes

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5

Expectation propagation for signal detection in flat-fading channels

Linear Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Midterm: CS 6375 Spring 2018

Logistic Regression Trained with Different Loss Functions. Discussion

Deep Feedforward Networks

Learning with Large Number of Experts: Component Hedge Algorithm

Logistic Regression. Will Monroe CS 109. Lecture Notes #22 August 14, 2017

Lecture 9: PGM Learning

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Linear & nonlinear classifiers

Dynamic Approaches: The Hidden Markov Model

Data-Intensive Computing with MapReduce

Introduction to Machine Learning

Transcription:

The Blessing and the Curse of the Multiplicative Updates Manfred K. Warmuth University of California, Santa Cruz CMPS 272, Feb 31, 2012 Thanks to David Ilstrup and Anindya Sen for helping with the slides Manfred K. Warmuth (UCSC) The Blessing and the Curse 1 / 51

Multiplicative updates Definition Algorithm maintains a weight vector Weights multiplied by non-negative factors Motivated by relative entropies as regularizer Big Questions Why are multiplicative updates so prevalent in nature? Are additive updates used? Are matrix versions of multiplicative updates used? Manfred K. Warmuth (UCSC) The Blessing and the Curse 2 / 51

Machine learning vs nature Typical multiplicative update in ML Set of experts predict something on a trial by trial basis s i is belief in of ith expert at trial t Update: [V,LW] s i := s i e ηloss i Z where η learning rate and Z normalizer Bayes update is special case Nature Weights are concentrations / shares in a population Next: implementation of Bayes in the tube Manfred K. Warmuth (UCSC) The Blessing and the Curse 3 / 51

Multiplicative updates Blessing Speed Curse Loss of variety Will give three machine learning methods for preventing the curse and discuss related methods used in nature Manfred K. Warmuth (UCSC) The Blessing and the Curse 4 / 51

In-vitro selection algorithm for finding RNA strand that binds to protein Make tube of random RNA strands protein sheet for t=1 to 20 do 1 Dip sheet with protein into tube 2 Pull out and wash off RNA that stuck 3 Multiply washed off RNA back to original amount (normalization) Manfred K. Warmuth (UCSC) The Blessing and the Curse 5 / 51

Modifications Start with modifications of a good candidate RNA strand Put a similar protein into solution Now attaching must be more specific Selection for a better variant Can t be done with computer because 10 20 RNA strands per liter of soup Manfred K. Warmuth (UCSC) The Blessing and the Curse 6 / 51

Basic scheme Start with unit amount of random RNA Loop 1 Functional separation into good RNA and bad RNA 2 Amplify good RNA to unit amount Manfred K. Warmuth (UCSC) The Blessing and the Curse 7 / 51

Duplicating DNA with PCR Invented by Kary Mullis 1985 Short primers hybridize at the ends Polymerase runs along DNA strand and complements bases Ideally all DNA strands multiplied by factor of 2 Manfred K. Warmuth (UCSC) The Blessing and the Curse 8 / 51

In-vitro selection algorithm for proteins DNA strands chemically similar RNA strands more varied and can form folded structures (Conjectured that first enzymes made of RNA) DNA Letters 4 RNA (Ribosome) Protein 4 20 Proteins are the materials of life For any material property, a protein can be found How to select for a protein with certain functionality? Manfred K. Warmuth (UCSC) The Blessing and the Curse 9 / 51

Tag proteins with their DNA Manfred K. Warmuth (UCSC) The Blessing and the Curse 10 / 51

In-vitro selection for protein Initialize tube with tagged random protein strands (tag of a protein is its RNA coding) Loop 1 Functional separation 2 Retrieve tags from good protein 3 Reproduce tagged protein from retrieved tags with Ribosome Almost doable... Manfred K. Warmuth (UCSC) The Blessing and the Curse 11 / 51

The caveat Not many interesting/specific functional separations found Need high throughput for functional separation Manfred K. Warmuth (UCSC) The Blessing and the Curse 12 / 51

Mathematical description of in-vitro selection Name the unknown RNA strands as 1, 2,... n where n 10 20 (# of strands in 1 liter) s i 0 is share of RNA i Contents of tube represented as share vector s = (s 1, s 2,..., s n ) When tube has unit amount, then s 1 + s 2 + + s n = 1 W i [0, 1] is the fraction of one unit of RNA i that is good W i is fitness of RNA strand i for attaching to protein Protein represented as fitness vector W = (W 1, W 2,..., W n ) Vectors s, W unknown: Blind Computation! Strong assumption: Fitness W i independent of share vector s Manfred K. Warmuth (UCSC) The Blessing and the Curse 13 / 51

Update Good RNA in tube s: s 1 W 1 + s 2 W 2 + + s n W n = s W Bad RNA: s 1 (1 W 1 ) + s 2 (1 W 2 ) + + s n (1 W n ) = 1 s W Amplification: Good share of RNA i is s i W i - multiplied by factor F If precise, then all good RNA multiplied by same factor F Final tube at end of loop F s W Since final tube has unit amount of RNA Update in each loop F s W = 1 and F = 1 s W s i := s iw i s W Manfred K. Warmuth (UCSC) The Blessing and the Curse 14 / 51

Implementing Bayes rule in RNA s i is prior P(i) W i [0..1] is probability P(Good i) s W is probability P(Good) s i := s iw i s W is Bayes rule P(i Good) = }{{} posterior prior {}}{ P(i) P(Good i) P(Good) Manfred K. Warmuth (UCSC) The Blessing and the Curse 15 / 51

Iterating Bayes Rule with same data likelihoods W 1 =.9 s 1 shares W 2 =.75 s 2 s 3 W 3 =.8 Initial s = (.01,.39,.6) W = (.9,.75,.8) t Manfred K. Warmuth (UCSC) The Blessing and the Curse 16 / 51

max i W i eventually wins.9 shares.84.7 Largest x i always increasing Smallest x i always decreasing.85 s 0,i W i t normalization s t,i = t is speed parameter Manfred K. Warmuth (UCSC) The Blessing and the Curse 17 / 51

Negative range of t reveals sigmoids Smallest: Reverse Sigmoid Largest: Sigmoid.3.49.54.5.5 takes over.49 Manfred K. Warmuth (UCSC) The Blessing and the Curse 18 / 51

The best are too good Multiplicative update s i s i fitness factor i }{{} 0 Blessing: Best get amplified exponentially fast Curse: The best wipe out the others Loss of variety Manfred K. Warmuth (UCSC) The Blessing and the Curse 19 / 51

A key problem 3 Strands of RNA Want to amplify mixture while keeping percentages unchanged A 15% B 60% C 25% Iterative amplification will favor one! Assumption: W i independent of shares Manfred K. Warmuth (UCSC) The Blessing and the Curse 20 / 51

Solution Make one long strand with right percentages of A,B,C A B B B C A B B Amplify same long strand many times Chop into constituents Long strand functions as chromosome Free floating genes in the nucleus would compete Loss of variety Manfred K. Warmuth (UCSC) The Blessing and the Curse 21 / 51

Coupling preserves diversity A B PCR A A Un-coupled A and B compete A+B PCR A+B Coupled A and B should cooperate Manfred K. Warmuth (UCSC) The Blessing and the Curse 22 / 51

Questions: What updates to s represent an iteration of an in-vitro selection algorithm? What updates are possible with blind computation? Must maintain non-negative weights Assumptions: s independent of W s W measureable So far: s i := s i W i normaliz. More tempered update: s i γ s i (1 W i ) + s }{{} i W i, 0 γ < 1 }{{} BAD GOOD s i (γ(1 W i ) + W i ) }{{} 0 Manfred K. Warmuth (UCSC) The Blessing and the Curse 23 / 51

More challenging goal Find set of RNA strands that binds to a number of different proteins P 1 P 2 P 3 P 4 P 5 W 1 W 2 W 3 W 4 W 5 t : 1 2 3 4 5 In trial t, functional separation based on fitness vector W t So far all W t the same... Manfred K. Warmuth (UCSC) The Blessing and the Curse 24 / 51

In-vitro selection algorithm P 1 P 2 P 3 P 4 P 5 W t,1 W t,2 u W t 0.9 0.8 0.96 0.2 0.04 0.1 0.01 0.8 0.9 0.8 0.5 0.405 0.88 0.55 0.42 Goal Tube Tube: u = (0.5, 0.5, 0,..., 0) The two strands cooperatively solve the problem Related to disjunction of two variables Manfred K. Warmuth (UCSC) The Blessing and the Curse 25 / 51

Key problem 2 Goal Starting with 1 liter tube in which all strands appear with frequency 10 20, use PCR and..., to arrive at tube : ( 0.5, 0.5, 0,..., 0) Problem Over train with P 1 and P 2 s 1 1, s 2 0 Over train with P 4 and P 5 s 1 0, s 2 1 Want blind computation! W t and initial s unknown Need some kind of feedback in each trial Manfred K. Warmuth (UCSC) The Blessing and the Curse 26 / 51

Related machine learning problem Tube share vector s probability simplex N Proteins Example vectors W t {0, 1} N Assumption u = (0, 0, 1 k, 0, 0, 1 k, 0, 0, 1 k, 0, 0, 0) w. k non-zero components of value 1 k s.t. t : u W t 1 2k Goal Find s : s W t 1 2k Manfred K. Warmuth (UCSC) The Blessing and the Curse 27 / 51

Normalized Winnow Algorithm (N. Littlestone) Do passes over examples W t if s t 1 W t 1 2k then s t := s t 1 conservative update { st 1,i if W else s t,i := t,i = 0 αs t 1,i if W t,i = 1, where α > 1 and re-normalize O(k log N k ) updates needed when labels consistent with k-literal disjunction Logarithmic in N Optimal to within constant factors Manfred K. Warmuth (UCSC) The Blessing and the Curse 28 / 51

Back to in-vitro selection Fix If good side large enough, then don t update if s t 1 W t δ : then s t := s t 1 conservative update else s t,i := s t 1,i ((1 s t,i )γ t + s t,i ), where, 0 γ t < 1 }{{} non-neg.factor and re-normalize Can simulate Normalized Winnow Concerns How exactly can we measure s W? Some RNA might be inactive! Manfred K. Warmuth (UCSC) The Blessing and the Curse 29 / 51

Problems with in-vitro selection Story so far For learning multiple goals - prevent overtraining with conservative update Alternate trick Cap weights - Start with nature - How we used same trick in Machine learning Manfred K. Warmuth (UCSC) The Blessing and the Curse 30 / 51

Alternate trick: cap weights Super predator algorithm Preserves variety c Lion copyrights belong to Andy Warhol Manfred K. Warmuth (UCSC) The Blessing and the Curse 31 / 51

Alternate trick: cap weights 0.25 0.2 0.15 0.1 0.05 0 1 2 3 4 5 6 s i = s i e η Loss i Z s = capped and rescaled version of s Cap and re-scale rest Manfred K. Warmuth (UCSC) The Blessing and the Curse 32 / 51

Why capping? m sets encoded as probability vectors (0, 1 m, 0, 0, 1 m, 0, 1 m ) called m-corners ( n m ) of them The convex hull of the m-corners = capped probability simplex (½,0, ½) (0,½, ½) (½,½,0) Combined update converges to best corner of capped simplex - best set of size m Capping lets us learn multiple goals How to cap in in-vitro selection setting? Manfred K. Warmuth (UCSC) The Blessing and the Curse 33 / 51

What if environment changes over time Initial tube contains 10 20 different strands After a while - Loss of variety - Some selfish strand manages to dominate without doing the wanted function Fixes - Mix in a little bit of initial soup for enrichment - Or do sloppy PCR that introduces mutations Manfred K. Warmuth (UCSC) The Blessing and the Curse 34 / 51

Disk spindown problem Experts a set of n fixed timeouts : W 1, W 2,..., W n Master Alg. maintains a set of weights : s 1, s 2,..., s n Multiplicative update Problem s t+1,i = s t,i e η energy usage of timeout i Z Burst favors long timeout Coefficients of small timeouts wiped out Manfred K. Warmuth (UCSC) The Blessing and the Curse 35 / 51

Disk spindown problem Fix Mix in little bit of uniform vector s = Multiplicative Update s = (1 α) s + α ( 1 N, 1 N,..., 1 ), α is small N Nature uses mutation for same purpose Better fix Keep track of past average share vector r s = Multiplicative Update s = (1 α) s + α r - facilitates switch to previously useful vector - long-term memory Manfred K. Warmuth (UCSC) The Blessing and the Curse 36 / 51

Question Long Term Memory How is it realized in nature? Summary Sex? Junk DNA? 1 Conservative update - learn multiple goals 2 Upper bounding weights - 3 Lower bounding weights - robust against change Manfred K. Warmuth (UCSC) The Blessing and the Curse 37 / 51

Motivations of the updates? Motivate additive and multiplicative updates Use linear regression as example problem Deep underlying question What are updates used by nature and why? Manfred K. Warmuth (UCSC) The Blessing and the Curse 38 / 51

Online linear regression For t = 1, 2,... Get instance x t R n Predict ŷ t = s t x t Get label y t R Incur square loss (y t ŷ t ) 2 Update s t s t+1 y y t } w t ŷ t x t x Manfred K. Warmuth (UCSC) The Blessing and the Curse 39 / 51

Two main update families - linear regression Additive s t+1 = s t η (s t x t y t )x t }{{} Gradient of Square Loss Motivated by squared Euclidean distance Weights can go negative Gradient Descent (GD) - SKIING Multiplicative s t+1,i = s t,i e η(s t x t y t )x t,i Z t Motivated by relative entropy Updated weight vector stays on probability simplex Exponentiated Gradient (EG) - LIFE! [KW97] Manfred K. Warmuth (UCSC) The Blessing and the Curse 40 / 51

Additive Updates Goal Minimize tradeoff between closeness to last share vector and loss on last example s t+1 = argmin s U(s) = s s t 2 2 }{{} divergence η > 0 is the learning rate/speed U(s) +η (s x t y t ) 2 }{{} loss Manfred K. Warmuth (UCSC) The Blessing and the Curse 41 / 51

Additive updates Therefore, U(s) s i si =s t+1,i = 2(s t+1,i s t,i ) + 2η(s x t y t )x t,i = 0 implicit: s t+1 = s t η(s t+1 x t y t )x t explicit: = s t η(s t x t y t )x t Manfred K. Warmuth (UCSC) The Blessing and the Curse 42 / 51

Multiplicative updates s t+1 = argmin U(s) i s i =1 where U(s) = s i ln s i +η(s x t y t ) 2 s }{{ t,i } relative entropy Define Lagrangian L(s) = s i ln s i where λ Lagrange coeff. s t,i + η(s x t y t ) 2 + λ( i s i 1) Manfred K. Warmuth (UCSC) The Blessing and the Curse 43 / 51

Multiplicative updates L(s) s i = ln s i s t,i + 1 + η(s x t y t )x t,i + λ = 0 ln s t+1,i s t,i = η(s t+1 x t y t )x t,i λ 1 s t+1,i = s t,i e η(s t+1 x t y t )x t,i e λ 1 Enforce normalization constraint by setting e λ 1 to 1/Z t implicit: s t+1,i = s t,ie ηs t+1 x t y t )x t,i Z t explicit: = s t,ie ηs t x t y t )x t,i Z t Manfred K. Warmuth (UCSC) The Blessing and the Curse 44 / 51

Connection to Biology One species for each of the n dimension Fitness rates of species i in time interval (t, t + 1] clamped to w i = η(s t+1 x t y t )x t,i Fitness W i = e w i Algorithm can be seen as running a population Can t go to continuous time, since data points arrive at discrete times 1,2,3,... Manfred K. Warmuth (UCSC) The Blessing and the Curse 45 / 51

GD via differential equations Here f (x t ) = f (x t ) t Forward Euler s t+h s t h = η L(s t ) s t+h = s t ηh L(s t ) explicit: s t+1,i h=1 = s t η L(s t ) s t = η L(s t ) Two ways to discretize: Backward Euler s t+h h s t+h h = η L(s t+h ) s t+h = s t ηh L(s t+h ) implicit: s t+1 h=1 = s t η L(s t+1 ) Manfred K. Warmuth (UCSC) The Blessing and the Curse 46 / 51

EGU via differential equation Two ways to discretize: log s t = η L(s t ) Forward Euler log s t+h,i log s t,i h = η L(s t ) i s t+h,i = s t,i e ηh L(s t ) i explicit: s t+1,i h=1 = s t,i e η L(s t ) i Backward Euler log s t+h h,i log s t+h,i h = η L(s t+h ) i s t+h,i = s t,i e ηh L(s t+h) i implicit: s t+1,i h=1 = s t,i e η L(s t+1) i Derivation of normalized update more involved Manfred K. Warmuth (UCSC) The Blessing and the Curse 47 / 51

EG via differential equation Handle normalization by going one dimension lower ( ln Backward Euler ln s t+1,i 1 n 1 j=1 s t+1,j s t,i 1 n 1 j=1 s t,j ln ) s t,i 1 n 1 j=1 s t,j = η( L(s t ) i L(s t ) n ) = η( L(s t+1 ) i L(s t+1 ) n ) implicit : s t+1,i = s t,i e η L(s t+1) i n j=1 s t,j e η L(s t+1) j Manfred K. Warmuth (UCSC) The Blessing and the Curse 48 / 51

Motivation of Bayes Rule s t+1 = argmin U(s) i s i =1 where U(s) = s i ln s i s t,i }{{} relative entropy + s i ( ln P(y t y t 1,..., y 0, i)) i Plug in solution s t+1,i = s t,i P(y t y t 1,..., y 0 ) P(y t y t 1,..., y 0 ) U(s t+1 ) = ln s t,i P(y t y t 1,..., y 0 ) i = ln P(y t y t 1,..., y 0 ) Manfred K. Warmuth (UCSC) The Blessing and the Curse 49 / 51

Summary Multiplicative updates converge quickly, but wipe out diversity Changing conditions require reuse of previously learned knowledge/alternatives Diversity is a requirement for success Multiplicative updates occur in nature Diversity preserved through mutations, linking, super-predator Machine learning preserves diversity through conservative update, capping, lower bounding weights, weighted mixtures of the past,... Manfred K. Warmuth (UCSC) The Blessing and the Curse 50 / 51

Final question? Does nature use the matrix version of the multiplicative updates? Parameter is density matrix - maintains uncertainty over directions as symmetric positive matrix S exp(logs η L(S)) S := trace of above Quantum relative entropy as regularizer Manfred K. Warmuth (UCSC) The Blessing and the Curse 51 / 51