Online Classification: Perceptron and Winnow

Similar documents
COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #16 Scribe: Yannan Wang April 3, 2014

1 The Mistake Bound Model

Kernel Methods and SVMs Extension

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture # 15 Scribe: Jieming Mao April 1, 2013

10-701/ Machine Learning, Fall 2005 Homework 3

Lecture 4. Instructor: Haipeng Luo

Generalized Linear Methods

Ensemble Methods: Boosting

Excess Error, Approximation Error, and Estimation Error

1 Convex Optimization

Boostrapaggregating (Bagging)

We present the algorithm first, then derive it later. Assume access to a dataset {(x i, y i )} n i=1, where x i R d and y i { 1, 1}.

Feature Selection: Part 1

Multilayer Perceptron (MLP)

Homework Assignment 3 Due in class, Thursday October 15

Lecture 10 Support Vector Machines. Oct

CS : Algorithms and Uncertainty Lecture 14 Date: October 17, 2016

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

Learning Theory: Lecture Notes

8.4 COMPLEX VECTOR SPACES AND INNER PRODUCTS

Maximal Margin Classifier

Lecture Notes on Linear Regression

Which Separator? Spring 1

Supporting Information

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018

PHYS 705: Classical Mechanics. Newtonian Mechanics

Online Linear Regression using Burg Entropy

CSCI B609: Foundations of Data Science

Lecture 12: Discrete Laplacian

CS 3710: Visual Recognition Classification and Detection. Adriana Kovashka Department of Computer Science January 13, 2015

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

Lecture 10 Support Vector Machines II

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

Chapter 9: Statistical Inference and the Relationship between Two Variables

Lecture 4: Universal Hash Functions/Streaming Cont d

Discriminative classifier: Logistic Regression. CS534-Machine Learning

Linear Classification, SVMs and Nearest Neighbors

Linear Feature Engineering 11

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING

EPR Paradox and the Physical Meaning of an Experiment in Quantum Mechanics. Vesselin C. Noninski

Lecture 10: May 6, 2013

CSC 411 / CSC D11 / CSC C11

The exam is closed book, closed notes except your one-page cheat sheet.

find (x): given element x, return the canonical element of the set containing x;

STAT 3008 Applied Regression Analysis

Multilayer Perceptrons and Backpropagation. Perceptrons. Recap: Perceptrons. Informatics 1 CG: Lecture 6. Mirella Lapata

( ) 1/ 2. ( P SO2 )( P O2 ) 1/ 2.

Support Vector Machines. Vibhav Gogate The University of Texas at dallas

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4)

j) = 1 (note sigma notation) ii. Continuous random variable (e.g. Normal distribution) 1. density function: f ( x) 0 and f ( x) dx = 1

PHYS 705: Classical Mechanics. Calculus of Variations II

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

princeton univ. F 17 cos 521: Advanced Algorithm Design Lecture 7: LP Duality Lecturer: Matt Weinberg

C4B Machine Learning Answers II. = σ(z) (1 σ(z)) 1 1 e z. e z = σ(1 σ) (1 + e z )

Temperature. Chapter Heat Engine

Natural Language Processing and Information Retrieval

CSE 546 Midterm Exam, Fall 2014(with Solution)

18.1 Introduction and Recap

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Lecture 5 September 17, 2015

Assortment Optimization under MNL

Notes on Frequency Estimation in Data Streams

Module 3: Element Properties Lecture 1: Natural Coordinates

Outline. Bayesian Networks: Maximum Likelihood Estimation and Tree Structure Learning. Our Model and Data. Outline

Instance-Based Learning (a.k.a. memory-based learning) Part I: Nearest Neighbor Classification

Lecture 12: Classification

The Minimum Universal Cost Flow in an Infeasible Flow Network

Problem Set 9 Solutions

Basically, if you have a dummy dependent variable you will be estimating a probability.

COS 511: Theoretical Machine Learning

17 Support Vector Machines

11 Tail Inequalities Markov s Inequality. Lecture 11: Tail Inequalities [Fa 13]

Spin-rotation coupling of the angularly accelerated rigid body

A be a probability space. A random vector

Some basic inequalities. Definition. Let V be a vector space over the complex numbers. An inner product is given by a function, V V C

The Experts/Multiplicative Weights Algorithm and Applications

Semi-supervised Classification with Active Query Selection

CHAPTER 17 Amortized Analysis

Comparative Studies of Law of Conservation of Energy. and Law Clusters of Conservation of Generalized Energy

Yong Joon Ryang. 1. Introduction Consider the multicommodity transportation problem with convex quadratic cost function. 1 2 (x x0 ) T Q(x x 0 )

Hidden Markov Models & The Multivariate Gaussian (10/26/04)

CS 2750 Machine Learning. Lecture 5. Density estimation. CS 2750 Machine Learning. Announcements

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 12 10/21/2013. Martingale Concentration Inequalities and Applications

Chapter 6. Supplemental Text Material

On some variants of Jensen s inequality

MACHINE APPLIED MACHINE LEARNING LEARNING. Gaussian Mixture Regression

Support Vector Machines

COMPARISON OF SOME RELIABILITY CHARACTERISTICS BETWEEN REDUNDANT SYSTEMS REQUIRING SUPPORTING UNITS FOR THEIR OPERATIONS

Regularized Discriminant Analysis for Face Recognition

A Note on Bound for Jensen-Shannon Divergence by Jeffreys

Discriminative classifier: Logistic Regression. CS534-Machine Learning

A new construction of 3-separable matrices via an improved decoding of Macula s construction

EEE 241: Linear Systems

Support Vector Machines CS434

Announcements EWA with ɛ-exploration (recap) Lecture 20: EXP3 Algorithm. EECS598: Prediction and Learning: It s Only a Game Fall 2013.

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

College of Computer & Information Science Fall 2009 Northeastern University 20 October 2009

VQ widely used in coding speech, image, and video

The Gaussian classifier. Nuno Vasconcelos ECE Department, UCSD

Linear Regression Analysis: Terminology and Notation

Transcription:

E0 370 Statstcal Learnng Theory Lecture 18 Nov 8, 011 Onlne Classfcaton: Perceptron and Wnnow Lecturer: Shvan Agarwal Scrbe: Shvan Agarwal 1 Introducton In ths lecture we wll start to study the onlne learnng settng that was dscussed brefly n the frst lecture. Unlke the batch settng we have studed so far, where one s gven a sample or batch of tranng data and the goal s to learn from ths data a model that can make accurate predctons n the future, n the onlne settng, learnng takes place n a sequence of trals: on each tral, the learner must make a predcton or take some acton, each of whch can potentally result n some loss, and the goal s to update the predcton/decson model at the end of each tral so as to mnmze the total loss ncurred over a sequence of such trals. Onlne learnng s relevant for a varety of problems, ncludng predcton problems e.g. forecastng the weather the next day and decson/allocaton problems e.g. nvestng n dfferent stocks or mutual funds. We wll start by consderng onlne supervsed learnng problems, where on each tral, the learner receves an nstance and must predct ts label, followng whch the true label s revealed and a correspondng loss ncurred; as noted above, the goal of the learner s to mnmze the total loss over a sequence of trals. We wll focus n ths lecture on onlne bnary classfcaton problems, and n the next lecture on onlne regresson problems. We wll then dscuss onlne learnng from experts, a framework that can be useful for both onlne supervsed learnng problems and onlne decson/allocaton problems; we wll analyze ths framework n some detal n a couple of lectures, and then wll conclude n the last lecture wth a bref dscusson of how onlne learnng algorthms and ther analyses can be transported back nto the batch settng. The basc onlne bnary classfcaton settng can be descrbed as follows: Onlne Bnary Classfcaton Receve nstance x t X Predct ŷ t {±1} Incur loss ly t, ŷ t The goal of a learnng algorthm n ths settng s to mnmze the total loss ncurred. Specfcally, let S x 1, y 1,..., x T, y T. Then the cumulatve loss of an algorthm A on the tral sequence S s gven by L l S[A] T ly t, ŷ t. 1 t1 The goal s to desgn algorthms wth small cumulatve loss on any tral sequence or any tral sequence satsfyng certan propertes; the analyss here s therefore worst-case, rather than probablstc as n the batch settng. For bnary classfcaton wth zero-one loss l 0-1, the cumulatve loss of an algorthm over a tral sequence S corresponds to the number of predcton mstakes made by the algorthm on ths sequence; bounds on the cumulatve zero-one loss S [A] are therefore termed mstake bounds. In the followng, we wll study two classcal algorthms for onlne bnary classfcaton, namely the perceptron and wnnow algorthms, and dscuss the mstake bounds that can be derved for them. 1

Onlne Classfcaton: Perceptron and Wnnow Perceptron In ts basc form, the perceptron algorthm apples to Eucldean nstance spaces X R n, and mantans a lnear classfer represented by a weght vector n such a space: Algorthm Perceptron Intal weght vector w 1 0 R n Receve nstance x t X R n Predct ŷ t sgnw t x t Incur loss l 0-1 y t, ŷ t w t1 w t y t x t w t1 w t Notce that the algorthm makes an update to ts model weght vector only when there s a mstake n ts predcton; onlne algorthms satsfyng ths property are sad to be conservatve. To get an ntutve feel for the algorthm, observe that f the true label y t on tral t s 1 and the algorthm predcts ŷ t 1, then t means w t x t < 0; n order to mprove the predcton on ths example, the algorthm must ncrease the value of ths dot product. Indeed, we have w t1 x t w t x t x t w t x t. Smlarly, t can be verfed that when y t 1 and the algorthm predcts ŷ t 1, the update has the effect of decreasng the value of the dot product. Thus the updates make sense ntutvely. More formally, one can prove the followng classcal mstake bound for the perceptron algorthm n the lnearly separable case: Theorem.1 Perceptron Convergence Theorem; Block, 196; Novkoff, 196. Let S x 1, y 1,..., x T, y T R n {±1} T. Let R max{ x t t [T ]} and let > 0. Then for any u R n such that y t u x t t [T ], S [Perceptron] R u. Proof. Denote S [Perceptron] k. Consder measurng the progress towards u or closeness to u on each tral n terms of w t u. For each tral t on whch there s a mstake, we have w t1 u w t u y t x t u. For all other trals t, we have w t1 u w t u 0. Therefore summng over t 1,..., T gves Notng that w 1 0 and usng Cauchy-Schwartz, we have w T 1 u w 1 u k. 3 k wt 1 u Now for each tral t on whch there s a mstake, wt 1 u. 4 w t1 w t y t w t x t x t 5 w t R snce y t w t x t 0 for a mstake tral. 6 For all other trals t, w t1 w t 0. Therefore summng over t 1,..., T and notng agan w 1 0, we get w T 1 kr. 7 Substtutng n Eq. 4 gves Squarng both sdes yelds the result. k kr u. 8

Onlne Classfcaton: Perceptron and Wnnow 3 One can also show the followng weaker mstake bound n the general non-separable case: Theorem. Freund and Schapre; 1999. Let S x 1, y 1,..., x T, y T R n {±1} T. Let R max{ x t t [T ]} and let > 0. Then for any u R n, S [Perceptron] R T u t1 yt u x t. Detals of the proof can be found n [1]. We conclude our dscusson of the perceptron algorthm by observng that the algorthm can be re-wrtten so as to use only dot products between nstances seen by the algorthm, whch facltates a natural extenson to a kernel-based varant for arbtrary nstance spaces X : Algorthm Kernel Perceptron Kernel functon K : X X R Receve nstance x t X t 1 Predct ŷ t sgn r1 α rkx r, x t Incur loss l 0-1 y t, ŷ t α t y t α t 0 A mstake bound smlar to that for the lnear perceptron algorthm can be shown n ths case too: Theorem.3 Kernel Perceptron Convergence Theorem. Let S x 1, y 1,..., x T, y T X {±1} T and let K : X X R be a kernel functon on X. Let R max{ Kx t, x t t [T ]} and let > 0. Then for any u X such that y t Ku, x t t [T ], We leave the proof detals as an exercse. S [PerceptronK] R Ku, u. 3 Wnnow The wnnow algorthm also mantans a lnear classfer n a Eucldean nstance space; n ths case, however, the updates to the weght vector are multplcatve rather than addtve: Algorthm Wnnow Learnng rate parameter η > 0 Intal weght vector w 1 1 n,..., 1 n Rn Receve nstance x t X R n Predct ŷ t sgnw t x t Incur loss l 0-1 y t, ŷ t [n]: w t1 wt expηyt x t w t1 w t Z t where Z t n j1 wt j expηyt x t j

4 Onlne Classfcaton: Perceptron and Wnnow Here too, one can observe that when a mstake s made on some tral t, the effect of the update s to move the dot product w t1 x t n the rght drecton compared to w t x t. Formally, we have the followng mstake bound for tral sequences that are lnearly separable by a non-negatve weght vector: Theorem 3.1. Let S x 1, y 1,..., x T, y T R n {±1} T. Let R max{ x t t [T ]} and let > 0. Then for any u R n such that u 0 [n] and y t u x t t [T ], S [Wnnowη] η u 1 ln u 1 ln n eηr e ηr. Moreover, f R, u 1, and are known, then one can select η to yeld R S [Wnnowη ] u 1 ln n. Proof. Denote S [Wnnowη] k, and let p u/ u 1 so that p n, where n s the probablty smplex n R n. Consder agan measurng the progress towards u or p on each tral; n ths case, we wll measure the dstance of w t from p n terms of the KL-dvergence, KLp w t n p ln p w. For each t tral t on whch there s a mstake, we have KLp w t KLp w t1 n w t1 p ln w t n ηy t x t p ln Z t n ηy t p x t 9 10 n p ln Z t 11 ηy t p x t ln Z t 1 η ln Z t. u 1 13 Now, Z t n wt x t eηyt. Notng that y t x t [ R, R ] for all, t, we can bound Z t as follows usng convexty of the mappng t e ηt : n 1 y Z t w t t x t /R 1 y t x t e ηr /R e ηr 14 e ηr n w t e ηr n y t w x t t 15 e ηr e e ηr y t w t x t 16 e ηr snce e ηr e ηr > 0, and y t w t x t 0 for mstake trals t. 17 Therefore, on each mstake tral t, we have On all other trals t, KLp w t KLp w t1 η e e ηr ln u 1. 18 KLp w t KLp w t1 0. 19 Therefore summng over t 1,..., T, we have KLp w 1 KLp w T 1 η e e ηr k ln. u 1 0

Onlne Classfcaton: Perceptron and Wnnow 5 Now, Ths yelds the desred bound k KLp w 1 ln n. 1 KLp w T 1 0. η u 1 ln u 1 ln n eηr e ηr. 3 Now f R, u 1, and are known, then one can mnmze the rght hand sde above w.r.t. η; ths yelds η 1 R u 1 ln. 4 R R u 1 Wth ths choce of η, one gets k ln n, 5 g R u 1 where gɛ 1ɛ 1 ɛ ln1 ɛ ln1 ɛ note that /R u 1 1, snce y t u t x t u 1 R. One can show that gɛ ɛ /, whch when appled to the above yelds the desred result. 4 Comparson of the Two Algorthms To understand the relatve strengths of the two algorthms, consder the followng two examples, where k n: Example 1 Sparse target vector, dense nstances. Let u {0, 1} n wth at most k non-zero components, and let x t {±1} n t. Thus u 1 k, u k, R n, and R 1. Example Dense target vector, sparse nstances. Let u 1 R n, and let x t {0, 1} n t such that each x t has at most k non-zero components. Thus u 1 n, u n, R k, and R 1. In Example 1, the mstake bound we get for perceptron s nk, whle that for wnnow s k ln n. On the other hand, n Example, the mstake bound we get for perceptron s kn, whereas that for wnnow s n ln n. Thus, for a sparse target vector that depends on only a small number of relevant features, wnnow gves a better mstake bound; for dense target vectors and sparse nstances, perceptron has a better bound. 5 Next Lecture In the next lecture we wll see both addtve and multplcatve update algorthms for onlne regresson, and wll derve bounds on ther regret, whch measures the cumulatve loss of the algorthm wth respect to the best possble loss wthn some class of predctors. Acknowledgments. The proof of the mstake bound for wnnow s based on a proof descrbed by Sham Kakade and Ambuj Tewar n ther lecture notes for a course taught at TTI Chcago n Sprng 008. References [1] Yoav Freund and Robert E. Schapre. Large margn classfcaton usng the perceptron algorthm. Machne Learnng, 373:77 96, 1999.