Lecture 16: Perceptron and Exponential Weights Algorithm

Similar documents
THE WEIGHTED MAJORITY ALGORITHM

Online Learning with Experts & Multiplicative Weights Algorithms

Lecture 23: Online convex optimization Online convex optimization: generalization of several algorithms

OLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research

Online Learning, Mistake Bounds, Perceptron Algorithm

Online learning CMPUT 654. October 20, 2011

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

Full-information Online Learning

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research

4.1 Online Convex Optimization

Online Learning and Sequential Decision Making

1 Review of the Perceptron Algorithm

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture 24 Scribe: Sachin Ravi May 2, 2013

Online Convex Optimization

Lecture 2: Weighted Majority Algorithm

Online Learning. Jordan Boyd-Graber. University of Colorado Boulder LECTURE 21. Slides adapted from Mohri

Warm up: risk prediction with logistic regression

1 Overview. 2 Learning from Experts. 2.1 Defining a meaningful benchmark. AM 221: Advanced Optimization Spring 2016

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley

Learning, Games, and Networks

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #24 Scribe: Jad Bechara May 2, 2018

Lecture 19: UCB Algorithm and Adversarial Bandit Problem. Announcements Review on stochastic multi-armed bandit problem

Support vector machines Lecture 4

Consistency of Nearest Neighbor Methods

Kernelized Perceptron Support Vector Machines

Logistic Regression Logistic

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

Littlestone s Dimension and Online Learnability

Kernels. Machine Learning CSE446 Carlos Guestrin University of Washington. October 28, Carlos Guestrin

FORMULATION OF THE LEARNING PROBLEM

Online Convex Optimization. Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016

Lecture 8. Instructor: Haipeng Luo

U Logo Use Guidelines

Exponentiated Gradient Descent

Calibrated Surrogate Losses

The Algorithmic Foundations of Adaptive Data Analysis November, Lecture The Multiplicative Weights Algorithm

Minimax risk bounds for linear threshold functions

Weighted Majority and the Online Learning Approach

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

1 Learning Linear Separators

Linear Classifiers and the Perceptron

The Multi-Arm Bandit Framework

Learning Theory: Basic Guarantees

Online Prediction Peter Bartlett

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Mistake Bound Model, Halving Algorithm, Linear Classifiers, & Perceptron

Online Learning Class 12, 20 March 2006 Andrea Caponnetto, Sanmay Das

Linear Regression (continued)

1 Review of Winnow Algorithm

Lecture 3 January 28

1. Implement AdaBoost with boosting stumps and apply the algorithm to the. Solution:

Multiclass Classification-1

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Hybrid Machine Learning Algorithms

Max Margin-Classifier

Agnostic Online learnability

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

Tutorial: PART 1. Online Convex Optimization, A Game- Theoretic Approach to Learning.

DS-GA 1002 Lecture notes 12 Fall Linear regression

Logistic regression and linear classifiers COMS 4771

Machine Learning for NLP

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley

Perceptron Mistake Bounds

Classification. Jordan Boyd-Graber University of Maryland WEIGHTED MAJORITY. Slides adapted from Mohri. Jordan Boyd-Graber UMD Classification 1 / 13

Theory and Applications of A Repeated Game Playing Algorithm. Rob Schapire Princeton University [currently visiting Yahoo!

Machine Learning Basics Lecture 3: Perceptron. Princeton University COS 495 Instructor: Yingyu Liang

Constrained Optimization and Lagrangian Duality

Regret bounded by gradual variation for online convex optimization

Classification objectives COMS 4771

MIRA, SVM, k-nn. Lirong Xia

Linear Regression. S. Sumitra

Stochastic and online algorithms

1 Learning Linear Separators

Online Bayesian Passive-Aggressive Learning

Manual for a computer class in ML

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

Statistical Machine Learning II Spring 2017, Learning Theory, Lecture 4

Time Series Prediction & Online Learning

Machine Learning for NLP

Advanced Topics in Machine Learning and Algorithmic Game Theory Fall semester, 2011/12

The Perceptron Algorithm 1

Logistic Regression. Machine Learning Fall 2018

The Perceptron Algorithm, Margins

Bandit Algorithms. Zhifeng Wang ... Department of Statistics Florida State University

Lecture 6: Non-stochastic best arm identification

CPSC 340: Machine Learning and Data Mining

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

The No-Regret Framework for Online Learning

Lecture 16: FTRL and Online Mirror Descent

CS260: Machine Learning Theory Lecture 12: No Regret and the Minimax Theorem of Game Theory November 2, 2011

EE 511 Online Learning Perceptron

Symmetrization and Rademacher Averages

Online Passive-Aggressive Algorithms

ECS171: Machine Learning

Foundations of Machine Learning

CS261: A Second Course in Algorithms Lecture #11: Online Learning and the Multiplicative Weights Algorithm

Support Vector Machines and Bayes Regression

Support Vector Machines and Kernel Methods

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley

Online Learning for Time Series Prediction

Transcription:

EECS 598-005: Theoretical Foundations of Machine Learning Fall 2015 Lecture 16: Perceptron and Exponential Weights Algorithm Lecturer: Jacob Abernethy Scribes: Yue Wang, Editors: Weiqing Yu and Andrew Melfi 161 Review: the Halving Algorithm 1611 Problem Setting Last lecture we started our discussion of online learning, and more specifically, prediction with expert advice In this problem setting, time proceeds as a sequence of rounds, t 1,, T At each round t, the environment reveals an outcome y t, 1} We also have a pool of experts, i 1,, N At each round t, the i-th expert makes a prediction on the outcome, f t i, 1} Our goal is to design an algorithm1 A that combines the experts advice to predict the outcome with as few mistakes as possible It can be described as the following process Initialize the counter #mistakes[a] 0 For each round t 1,, T : (1) A observes the prediction made by N experts, f t i, 1}, i 1,, N; (2) A makes prediction on ŷ t, 1}; (3) The environment reveals an outcome y t, 1}; (4) If ŷ t y t, #mistakes[a] + 1 Here, the environment (or nature ) is a mechanism that generates the sequence of outcome In online learning setting, this mechanism can be deterministic, stochastic, or even adversarial (adaptive to A s behavior) This is much more general than the batch learning setting, where the mechanism is assumed to be stochastic 1612 The Halving Algorithm The Halving Algorithm takes the majority vote of consistent experts up to the current round Formally, denote the set of consistent experts up to round t as C t, C t i : f τ i y τ, τ 1,, t 1} On round t, the Halving Algorithm makes its prediction as 1, if i C t : fi t ŷ t, otherwise 1} Ct 2 ; Theorem 161 (Upper bound on #mistakes[halving]) If there exists a perfect expert i such that f t i y t for all t 1,, T, then #mistakes[halving] log 2 N 1 Also called predictor, forecaster, or master algorithm 16-1

Lecture 16: Perceptron and Exponential Weights Algorithm 16-2 Proof: We have C 1 N By the assumption that there is always at least one perfect expert, C T 1 By the nature of Halving Algorithm, if ŷ t y t, then C t+1 1 2 C t Together we have 1 C T C 1 ( ) #mistakes[halving] 1 N 2 ( ) #mistakes[halving] 1 2 This implies #mistakes[halving] log 2 N Note that the upper bound is also tight To see this, imagine that the outcomes are generated by an adversary, who always wants to inflict mistake on A That is, when ŷ t 1, the adversary will play y t, and vice versa Suppose we have N 2 n experts, only one of whom is perfect At each round, half of them predict 1, the other half predict, which maximally slows down the shrinking rate of C t Then the Halving algorithm will make exactly log 2 N mistakes to discover the perfect expert, and makes no more mistakes from then on The Halving Algorithm works well if there exists a perfect expert, which we call a noise-free setting 2 What if the best expert makes a few mistakes? This is a noisy setting, and we need a better algorithm than majority vote The idea is to maintain weights on experts, and reduce an expert s weight when she makes mistakes But before we move on to the noisy setting, let us study another classical algorithm for the noise-free setting, the Perceptron algorithm for online linear prediction 162 Online Linear Prediction and Perceptron 1621 Problem Setting As before, we play in rounds t 1,, T At each round t, the environment reveals an outcome y t, 1} In online linear prediction, algorithm A observes a feature vector x R d before it makes prediction, and its prediction has the form ŷ t sign(w x), w R d 3 The prediction protocol can be described as following process Initialize the counter #mistakes[a] 0 For each round t 1,, T : (1) A selects w t R d ; (2) A observes x t R d, x t 1; (3) A predicts ŷ t sign(w t x t ); (4) The environment reveals an outcome y t, 1}; (5) If ŷ t y t, #mistakes[a] + 1 Essentially, A s prediction ŷ t is determined by the hypothesis weight vector w t 1622 The Perceptron Algorithm The Perceptron Algorithm initially predicts w 1 0; at round t 1,, T, ( ) w t, if y t w w t+1 t x t > 0; w t + y t x t, otherwise 2 The corresponding scenario in batch learning is to have a finite hypothesis class that contains the target hypothesis 3 sign( ) is the sign function: sign(x) 1, if x 0;, otherwise

Lecture 16: Perceptron and Exponential Weights Algorithm 16-3 Theorem 162 (Upper bound on #mistakes[perceptron]) Suppose there exists a perfect hypothesis w R d such that y t (w x t ) 1 for all t 1,, T, then #mistakes[perceptron] w 2 Proof: The key (trick) is to define the potential function and look at its change over time Additional notes: Φ t : w w t+1 2 w 2 w 0 2 w w T +1 Φ 0 Φ T Φ t Φ t w w t 2 w w t+1 2 w w t 2 w (w t + y t x t ) 2 t:y t(wt xt)<0 2 y t (w x t ) y t (w t x t ) yt 2 x t 2 }}}}}} t:y t(wt xt)<0 1 0 t:y t(w t xt)<0 1 #mistakes[perceptron] (1) Since a perfect hypothesis w satisfies y t (w x t ) > 0 for all t 1,, T, we can always scale it such that y t (w x t ) 1 as stated in Theorem 162 (2) In this setting, the hypothesis w defines a hyperplane which separates feature vectors into two classes 1 based on their outcomes We refer to w γ as the margin, because the distance between any feature vector and the hyperplane is at least γ (3) Recall that we have similar upper bound in the support vector machine (SVM) batch learning setting, where the training examples are assumed iid Here, we make no iid assumption on the sequence ((x 1, y 1 ),, (x T, y T )); the upper bound holds for any sequence 163 The Exponential Weights Algorithm The drawback of Halving Algorithm is the strong assumption that a perfect expert exists How can we remove this assumption? What if even the best expert makes mistakes? The answer is to use weights for the experts, based on past performance This leads to the Exponential Weights Algorithm (EWA) below

Lecture 16: Perceptron and Exponential Weights Algorithm 16-4 1631 Problem Setting Consider the same problem setting as Section 1611 We assume that the master algorithm A can make mixed predictions ŷ t, 1} That is, ŷ t takes the form i1 ŷ t wt i f i t Essentially, A s prediction ŷ t is determined by the weights w t (w t 1,, w t i, wt N ) RN In addition, assume that we are given a loss function l(ŷ, y) that is convex in ŷ Instead of absolute number of mistakes, we measure the performance of algorithm A by the cumulative loss it suffered after T rounds relative to the best expert: R T l(ŷ t, y t ) min l(fi t, y t ) 1 i N R T is called the regret Our goal is to design a master algorithm A with a small regret 1632 Exponential Weights Algorithm To ease notation, let us denote the cumulative loss of expert i up to round t as L t i t s1 l(f i s, y s), and the cumulative loss of algorithm A up to round t as L t A t s1 l(ŷ s, y s ) The regret of A after T rounds is then L T A min i L T i The Exponential Weights Algorithm (EWA) initially sets w 1 1; at round t 1,, T, it sets w t i exp( ηl t i), where η > 0 is a constant (hyperparameter) Intuitively, EWA says that the (unnormalized) weight of expert i decays exponentially with her cumulative loss Theorem 163 (Upper bound on L T EW A ) Assume the loss function takes value in [0, 1], then the T -round cumulative loss of Exponential Weights Algorithm satisfies that for all i, L T EW A ηlt i + log N 1 e η Corollary 164 (Regret upper bound after tuning η) For some optimal η > 0, where i arg min 1 i N L T i L T EW A L T i + log N + 2L T i log N, is the best expert in hindsight Lemma 165 Let X be a random variable in [0, 1] For all s R, Proof: First, for x [0, 1], log E [ e sx] (e s 1) E [X] e sx 1 + (e s 1) x This can be seen by drawing the graph of convex function f(x) e sx and line segment g(x) 1+(e s 1) x f(0)+ f(1) f(0) 1 0 x on the interval x [0, 1] In fact, g(x) is a chord of f(x) connecting (0, f(0)) and (1, f(1)) Since f(x) is convex, its graph is always below its chord g(x) This holds for all s R Taking expectation on both sides when x X is a random variable in [0, 1], we have: E [ e sx] 1 + (e s 1) E [X]

Lecture 16: Perceptron and Exponential Weights Algorithm 16-5 Since log( ) is monotonic, taking log on both sides preserves the inequality sign (both sides are strictly positive): log E [ e sx] log (1 + (e s 1) E [X]) Finally, notice that x R, log(1 + x) < x, we have log E [ e sx] (e s 1) E [X] Proof: (Theorem 163) Let w t Define the potential function Φ t log w t Observe the change of the potential function over time: Φ t+1 Φ t log w t+1 log w t exp ( ηl(f t i, y t)) By defining the random variable X [0, 1], Pr [X l(f t i, y t)] Φ t+1 Φ t log E X [ ηx] Summing from t 1 to T 1: ( 1 e η) E X [X] (Lemma 165) ( 1 e η) N wi t l(f i t, y t) i1 ( ( N ) 1 e η) i1 l wt i f i t, y t ( 1 e η) l(ŷ t, y t ) Φ T Φ 1 Φ t+1 Φ t ( 1 e η) T wt i N, i 1,, N, we have (l convex + Jensen s inequality) l(ŷ t, y t ) ( 1 e η) L T EW A (161) By nature of Exponential Weights Algorithm and the definition of the potential function Φ t, Φ 1 log Φ T log Combining (161), (162), and (163), we have that for all i, N wi 1 log N ; (162) i1 N wi T log wi T ηl T i, for all i (163) i1 ηl T i + log N ( 1 e η) L T EW A Since η > 0, 1 e η > 0 Dividing both ends by (1 e η ) will reveal the inequality in the theorem