Classification. Jordan Boyd-Graber University of Maryland WEIGHTED MAJORITY. Slides adapted from Mohri. Jordan Boyd-Graber UMD Classification 1 / 13

Similar documents
Online Learning. Jordan Boyd-Graber. University of Colorado Boulder LECTURE 21. Slides adapted from Mohri

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research

Mistake Bound Model, Halving Algorithm, Linear Classifiers, & Perceptron

Littlestone s Dimension and Online Learnability

Inexact Search is Good Enough

Introduction to Machine Learning

Empirical Risk Minimization Algorithms

Agnostic Online learnability

Discrete Probability Distributions

Online Learning, Mistake Bounds, Perceptron Algorithm

1. Implement AdaBoost with boosting stumps and apply the algorithm to the. Solution:

Introduction to Machine Learning

Expectations and Entropy

Tutorial: PART 1. Online Convex Optimization, A Game- Theoretic Approach to Learning.

Logistic Regression. Introduction to Data Science Algorithms Jordan Boyd-Graber and Michael Paul SLIDES ADAPTED FROM HINRICH SCHÜTZE

Introduction to Machine Learning

1 Review of the Perceptron Algorithm

Online Convex Optimization

Decision Trees. Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1

Online Learning with Experts & Multiplicative Weights Algorithms

PAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018

Lecture 16: Perceptron and Exponential Weights Algorithm

THE WEIGHTED MAJORITY ALGORITHM

Hypothesis Testing and Computational Learning Theory. EECS 349 Machine Learning With slides from Bryan Pardo, Tom Mitchell

Online Learning Class 12, 20 March 2006 Andrea Caponnetto, Sanmay Das

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Algorithms, Games, and Networks January 17, Lecture 2

Optimal and Adaptive Online Learning

Solving Regression. Jordan Boyd-Graber. University of Colorado Boulder LECTURE 12. Slides adapted from Matt Nedrich and Trevor Hastie

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16

On-Line Learning with Path Experts and Non-Additive Losses

Is our computers learning?

Online Learning Summer School Copenhagen 2015 Lecture 1

Lecture 2: Weighted Majority Algorithm

Clustering. Introduction to Data Science. Jordan Boyd-Graber and Michael Paul SLIDES ADAPTED FROM LAUREN HANNAH

Classification: The PAC Learning Framework

Advanced Machine Learning

Introduction to Machine Learning

CS 395T Computational Learning Theory. Scribe: Rahul Suri

Leaving The Span Manfred K. Warmuth and Vishy Vishwanathan

Multiclass Learnability and the ERM principle

Classification II: Decision Trees and SVMs

Game Theory, On-line prediction and Boosting (Freund, Schapire)

Support Vector Machines

Efficient Bandit Algorithms for Online Multiclass Prediction

VBM683 Machine Learning

Random Variables and Events

Learning Theory: Basic Guarantees

Lecture 8: Decision-making under total uncertainty: the multiplicative weight algorithm. Lecturer: Sanjeev Arora

Introduction to Machine Learning

Worst-Case Analysis of the Perceptron and Exponentiated Update Algorithms

Online Passive-Aggressive Algorithms

Linear Classifiers. Michael Collins. January 18, 2012

OLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research

Multiclass Learnability and the ERM Principle

The Perceptron algorithm

A simple algorithmic explanation for the concentration of measure phenomenon

Linear and Logistic Regression. Dr. Xiaowei Huang

Active Learning and Optimized Information Gathering

CS 446: Machine Learning Lecture 4, Part 2: On-Line Learning

Online Learning and Sequential Decision Making

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture 24 Scribe: Sachin Ravi May 2, 2013

GANs. Machine Learning: Jordan Boyd-Graber University of Maryland SLIDES ADAPTED FROM GRAHAM NEUBIG

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #24 Scribe: Jad Bechara May 2, 2018

Computational Learning Theory

Perceptron Mistake Bounds

New Algorithms for Contextual Bandits

Lecture 25 of 42. PAC Learning, VC Dimension, and Mistake Bounds

Classification objectives COMS 4771

Evaluation. Andrea Passerini Machine Learning. Evaluation

1 Overview. 2 Learning from Experts. 2.1 Defining a meaningful benchmark. AM 221: Advanced Optimization Spring 2016

Using Additive Expert Ensembles to Cope with Concept Drift

Agnostic Online Learning

Online Learning versus Offline Learning*

Computational Learning Theory

ECE 5424: Introduction to Machine Learning

DS-GA 1003: Machine Learning and Computational Statistics Homework 6: Generalized Hinge Loss and Multiclass SVM

Generalization, Overfitting, and Model Selection

Theory and Applications of A Repeated Game Playing Algorithm. Rob Schapire Princeton University [currently visiting Yahoo!

0.1 Motivating example: weighted majority algorithm

Language Models. Data Science: Jordan Boyd-Graber University of Maryland SLIDES ADAPTED FROM PHILIP KOEHN

Evaluation requires to define performance measures to be optimized

CS340 Machine learning Lecture 4 Learning theory. Some slides are borrowed from Sebastian Thrun and Stuart Russell

Machine Learning (CS 567) Lecture 2

Learning with Large Number of Experts: Component Hedge Algorithm

Time Series Prediction & Online Learning

Learning Ensembles. 293S T. Yang. UCSB, 2017.

Convex Repeated Games and Fenchel Duality

The No-Regret Framework for Online Learning

Outline: Ensemble Learning. Ensemble Learning. The Wisdom of Crowds. The Wisdom of Crowds - Really? Crowd wiser than any individual

Generalization, Overfitting, and Model Selection

Learning, Games, and Networks

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

A Tutorial on Computational Learning Theory Presented at Genetic Programming 1997 Stanford University, July 1997

Distributed Machine Learning. Maria-Florina Balcan, Georgia Tech

Name (NetID): (1 Point)

Computational Learning Theory

Online Bounds for Bayesian Algorithms

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9

Transcription:

Classification Jordan Boyd-Graber University of Maryland WEIGHTED MAJORITY Slides adapted from Mohri Jordan Boyd-Graber UMD Classification 1 / 13

Beyond Binary Classification Before we ve talked about combining weak predictor (boosting) Jordan Boyd-Graber UMD Classification 2 / 13

Beyond Binary Classification Before we ve talked about combining weak predictor (boosting) What if you have strong predictors? Jordan Boyd-Graber UMD Classification 2 / 13

Beyond Binary Classification Before we ve talked about combining weak predictor (boosting) What if you have strong predictors? How do you make inherently binary algorithms multiclass? How do you answer questions like ranking? Jordan Boyd-Graber UMD Classification 2 / 13

General Online Setting For t = 1 to T : Get instance x t X Predict ŷ t Y Get true label y t Y Incur loss L(ŷ t, y t ) Classification: Y = {0,1}, L(y, y ) = y y Regression: Y, L(y, y ) = (y y ) 2 Jordan Boyd-Graber UMD Classification 3 / 13

General Online Setting For t = 1 to T : Get instance x t X Predict ŷ t Y Get true label y t Y Incur loss L(ŷ t, y t ) Classification: Y = {0,1}, L(y, y ) = y y Regression: Y, L(y, y ) = (y y ) 2 Objective: Minimize total loss t L(ŷ t, y t ) Jordan Boyd-Graber UMD Classification 3 / 13

Prediction with Expert Advice For t = 1 to T : Get instance x t X and advice a t,i Y,i [1,N ] Predict ŷ t Y Get true label y t Y Incur loss L(ŷ t, y t ) Jordan Boyd-Graber UMD Classification 4 / 13

Prediction with Expert Advice For t = 1 to T : Get instance x t X and advice a t,i Y,i [1,N ] Predict ŷ t Y Get true label y t Y Incur loss L(ŷ t, y t ) Objective: Minimize regret, i.e., difference of total loss vs. best expert Regret(T ) = L(ŷ t, y t ) min L(a t,i, y t ) (1) i t t Jordan Boyd-Graber UMD Classification 4 / 13

Mistake Bound Model Define the maximum number of mistakes a learning algorithm L makes to learn a concept c over any set of examples (until it s perfect). M L (c ) = max x 1,...,x T mistakes(l, c ) (2) For any concept class C, this is the max over concepts c. M L (C ) = max c C M L(c ) (3) Jordan Boyd-Graber UMD Classification 5 / 13

Mistake Bound Model Define the maximum number of mistakes a learning algorithm L makes to learn a concept c over any set of examples (until it s perfect). M L (c ) = max x 1,...,x T mistakes(l, c ) (2) For any concept class C, this is the max over concepts c. M L (C ) = max c C M L(c ) (3) In the expert advice case, assumes some expert matches the concept (realizable) Jordan Boyd-Graber UMD Classification 5 / 13

Halving Algorithm H 1 H ; for t 1...T do Receive x t ; ŷ t Majority(H t, a t, x t ); Receive y t ; if ŷ t y t then H t +1 {a H t : a (x t ) = y t }; return H T +1Algorithm 1: The Halving Algorithm (Mitchell, 1997) Jordan Boyd-Graber UMD Classification 6 / 13

Halving Algorithm Bound (Littlestone, 1998) For a finite hypothesis set M Halving(H ) lg H (4) After each mistake, the hypothesis set is reduced by at least by half Jordan Boyd-Graber UMD Classification 7 / 13

Halving Algorithm Bound (Littlestone, 1998) For a finite hypothesis set M Halving(H ) lg H (4) After each mistake, the hypothesis set is reduced by at least by half Consider the optimal mistake bound opt(h ). Then VC(H ) opt(h ) M Halving(H ) lg H (5) For a fully shattered set, form a binary tree of mistakes with height VC(H ) Jordan Boyd-Graber UMD Classification 7 / 13

Halving Algorithm Bound (Littlestone, 1998) For a finite hypothesis set M Halving(H ) lg H (4) After each mistake, the hypothesis set is reduced by at least by half Consider the optimal mistake bound opt(h ). Then VC(H ) opt(h ) M Halving(H ) lg H (5) For a fully shattered set, form a binary tree of mistakes with height VC(H ) What about non-realizable case? Jordan Boyd-Graber UMD Classification 7 / 13

Weighted Majority (Littlestone and Warmuth, 1998) for i 1...N do w 1,i 1; for t 1...T do Receive x t ; ŷ t 1 a t,i =1 w t a t,i =0 w t Receive y t ; if ŷ t y t then for i 1...N do if a t,i y t then w t +1,i β w t,i ; else wt +1,i w t,i ; Weights for every expert Classifications in favor of side with higher total weight (y {0,1}) that are wrong get their weights decreased (β [0,1]) If you re right, you stay unchanged return w T +1 Jordan Boyd-Graber UMD Classification 8 / 13

Weighted Majority (Littlestone and Warmuth, 1998) for i 1...N do w 1,i 1; for t 1...T do Receive x t ; ŷ t 1 a t,i =1 w t a t,i =0 w t Receive y t ; if ŷ t y t then for i 1...N do if a t,i y t then w t +1,i β w t,i ; else wt +1,i w t,i ; Weights for every expert Classifications in favor of side with higher total weight (y {0,1}) that are wrong get their weights decreased (β [0,1]) If you re right, you stay unchanged return w T +1 Jordan Boyd-Graber UMD Classification 8 / 13

Weighted Majority (Littlestone and Warmuth, 1998) for i 1...N do w 1,i 1; for t 1...T do Receive x t ; ŷ t 1 a t,i =1 w t a t,i =0 w t Receive y t ; if ŷ t y t then for i 1...N do if a t,i y t then w t +1,i β w t,i ; else wt +1,i w t,i ; Weights for every expert Classifications in favor of side with higher total weight (y {0,1}) that are wrong get their weights decreased (β [0,1]) If you re right, you stay unchanged return w T +1 Jordan Boyd-Graber UMD Classification 8 / 13

Weighted Majority (Littlestone and Warmuth, 1998) for i 1...N do w 1,i 1; for t 1...T do Receive x t ; ŷ t 1 a t,i =1 w t a t,i =0 w t Receive y t ; if ŷ t y t then for i 1...N do if a t,i y t then w t +1,i β w t,i ; else wt +1,i w t,i ; Weights for every expert Classifications in favor of side with higher total weight (y {0,1}) that are wrong get their weights decreased (β [0,1]) If you re right, you stay unchanged return w T +1 Jordan Boyd-Graber UMD Classification 8 / 13

Weighted Majority Let mt be the number of mistakes made by WM until time t Let m t be the best expert s mistakes until time t N is the number of experts m t logn + m t log 1 β log 2 1+β (6) Thus, mistake bound is O (logn ) plus the best expert Halving algorithm β = 0 Jordan Boyd-Graber UMD Classification 9 / 13

Proof: Potential Function Potential function is the sum of all weights Φ t w t,i (7) We ll create sandwich of upper and lower bounds i Jordan Boyd-Graber UMD Classification 10 / 13

Proof: Potential Function Potential function is the sum of all weights Φ t w t,i (7) We ll create sandwich of upper and lower bounds For any expert i, we have lower bound i Φ t w t,i = β m t,i (8) Jordan Boyd-Graber UMD Classification 10 / 13

Proof: Potential Function Potential function is the sum of all weights Φ t w t,i (7) We ll create sandwich of upper and lower bounds For any expert i, we have lower bound i Φ t w t,i = β m t,i (8) Weights are nonnegative, so i w t,i w t,i Jordan Boyd-Graber UMD Classification 10 / 13

Proof: Potential Function Potential function is the sum of all weights Φ t w t,i (7) We ll create sandwich of upper and lower bounds For any expert i, we have lower bound i Φ t w t,i = β m t,i (8) Each error multiplicatively reduces weight by β Jordan Boyd-Graber UMD Classification 10 / 13

Proof: Potential Function (Upper Bound) If an algorithm makes an error at round t Φ t +1 Φ t 2 + βφ t 2 (9) Jordan Boyd-Graber UMD Classification 11 / 13

Proof: Potential Function (Upper Bound) If an algorithm makes an error at round t Φ t +1 Φ t 2 + βφ t 2 (9) Half (at most) of the experts by weight were right Jordan Boyd-Graber UMD Classification 11 / 13

Proof: Potential Function (Upper Bound) If an algorithm makes an error at round t Φ t +1 Φ t 2 + βφ t 2 (9) Half (at least) of the experts by weight were wrong Jordan Boyd-Graber UMD Classification 11 / 13

Proof: Potential Function (Upper Bound) If an algorithm makes an error at round t Φ t +1 Φ t 2 + βφ t 2 = 1 + β 2 Φ t (9) Jordan Boyd-Graber UMD Classification 11 / 13

Proof: Potential Function (Upper Bound) If an algorithm makes an error at round t Φ t +1 Φ t 2 + βφ t 2 = 1 + β 2 Φ t (9) Initially potential function sums all weights, which start at 1 Φ 1 = N (10) Jordan Boyd-Graber UMD Classification 11 / 13

Proof: Potential Function (Upper Bound) If an algorithm makes an error at round t Φ t +1 Φ t 2 + βφ t 2 = 1 + β 2 Φ t (9) Initially potential function sums all weights, which start at 1 Φ 1 = N (10) After mt mistakes after T rounds 1 + β mt Φ T N (11) 2 Jordan Boyd-Graber UMD Classification 11 / 13

Weighted Majority Proof Put the two inequalities together, using the best expert β m T 1 + β mt ΦT N (12) 2 Jordan Boyd-Graber UMD Classification 12 / 13

Weighted Majority Proof Put the two inequalities together, using the best expert β m T 1 + β mt ΦT N (12) 2 Take the log of both sides 1 + β m T logβ logn + m T log 2 (13) Jordan Boyd-Graber UMD Classification 12 / 13

Weighted Majority Proof Put the two inequalities together, using the best expert β m T 1 + β mt ΦT N (12) 2 Take the log of both sides 1 + β m T logβ logn + m T log 2 (13) Solve for mt m T logn + m T log 1 β (14) log 2 1+β Jordan Boyd-Graber UMD Classification 12 / 13

Weighted Majority Recap Simple algorithm No harsh assumptions (non-realizable) Depends on best learner Downside: Takes a long time to do well in worst case (but okay in practice) Solution: Randomization Jordan Boyd-Graber UMD Classification 13 / 13