Online Learning. Jordan Boyd-Graber. University of Colorado Boulder LECTURE 21. Slides adapted from Mohri

Similar documents
Classification. Jordan Boyd-Graber University of Maryland WEIGHTED MAJORITY. Slides adapted from Mohri. Jordan Boyd-Graber UMD Classification 1 / 13

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research

Inexact Search is Good Enough

Mistake Bound Model, Halving Algorithm, Linear Classifiers, & Perceptron

Online Learning, Mistake Bounds, Perceptron Algorithm

The Perceptron algorithm

CPSC 340: Machine Learning and Data Mining

Littlestone s Dimension and Online Learnability

Classification: The PAC Learning Framework

Online Convex Optimization

Support Vector Machines

Empirical Risk Minimization Algorithms

1 Review of the Perceptron Algorithm

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9

Machine Learning for NLP

Lecture 16: Perceptron and Exponential Weights Algorithm

THE WEIGHTED MAJORITY ALGORITHM

Perceptron Mistake Bounds

Perceptron (Theory) + Linear Regression

1. Implement AdaBoost with boosting stumps and apply the algorithm to the. Solution:

Warm up: risk prediction with logistic regression

Introduction to Machine Learning

The Perceptron Algorithm

Classification: Logistic Regression from Data

Lecture 25 of 42. PAC Learning, VC Dimension, and Mistake Bounds

Support vector machines Lecture 4

Introduction to Machine Learning

Introduction to Machine Learning

Machine Learning (CS 567) Lecture 2

Outline: Ensemble Learning. Ensemble Learning. The Wisdom of Crowds. The Wisdom of Crowds - Really? Crowd wiser than any individual

Online Learning with Experts & Multiplicative Weights Algorithms

Name (NetID): (1 Point)

Machine Learning for NLP

Clustering. Introduction to Data Science. Jordan Boyd-Graber and Michael Paul SLIDES ADAPTED FROM LAUREN HANNAH

COMS 4771 Introduction to Machine Learning. Nakul Verma

The Perceptron Algorithm, Margins

PAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018

Computational Learning Theory

Linear and Logistic Regression. Dr. Xiaowei Huang

Computational Learning Theory: Shattering and VC Dimensions. Machine Learning. Spring The slides are mainly from Vivek Srikumar

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

PAC-learning, VC Dimension and Margin-based Bounds

Generalization, Overfitting, and Model Selection

Tutorial: PART 1. Online Convex Optimization, A Game- Theoretic Approach to Learning.

Discriminative Models

Midterm, Fall 2003

Linear Classifiers. Michael Collins. January 18, 2012

Agnostic Online learnability

OLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research

Active Learning and Optimized Information Gathering

Learning Theory. Piyush Rai. CS5350/6350: Machine Learning. September 27, (CS5350/6350) Learning Theory September 27, / 14

Perceptron. Subhransu Maji. CMPSCI 689: Machine Learning. 3 February February 2015

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Logistic Regression. Machine Learning Fall 2018

Computational Learning Theory. Definitions

Lecture 5 Neural models for NLP

Introduction to Machine Learning

Part of the slides are adapted from Ziko Kolter

CS 446: Machine Learning Lecture 4, Part 2: On-Line Learning

Learning Theory Continued

CSC321 Lecture 4 The Perceptron Algorithm

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5

Computational Learning Theory

Predicting Graph Labels using Perceptron. Shuang Song

Discriminative Models

CS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims

Hypothesis Testing and Computational Learning Theory. EECS 349 Machine Learning With slides from Bryan Pardo, Tom Mitchell

Machine Learning (CSE 446): Learning as Minimizing Loss; Least Squares

Logistic Regression Logistic

Logistic regression and linear classifiers COMS 4771

Online Learning and Sequential Decision Making

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #24 Scribe: Jad Bechara May 2, 2018

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

VC Dimension Review. The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces.

Computational Learning Theory (VC Dimension)

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

CS 188: Artificial Intelligence. Outline

1 Learning Linear Separators

Game Theory, On-line prediction and Boosting (Freund, Schapire)

CS340 Machine learning Lecture 4 Learning theory. Some slides are borrowed from Sebastian Thrun and Stuart Russell

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley

Efficient and Principled Online Classification Algorithms for Lifelon

Online Learning Class 12, 20 March 2006 Andrea Caponnetto, Sanmay Das

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16

CSC 411 Lecture 17: Support Vector Machine

Optimization and Gradient Descent

Learning: Binary Perceptron. Examples: Perceptron. Separable Case. In the space of feature vectors

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

The No-Regret Framework for Online Learning

1 Overview. 2 Learning from Experts. 2.1 Defining a meaningful benchmark. AM 221: Advanced Optimization Spring 2016

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture 24 Scribe: Sachin Ravi May 2, 2013

Logistic Regression. COMP 527 Danushka Bollegala

Classification II: Decision Trees and SVMs

4.1 Online Convex Optimization

Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron)

ML4NLP Multiclass Classification

Machine Learning Theory (CS 6783)

Training the linear classifier

More about the Perceptron

Transcription:

Online Learning Jordan Boyd-Graber University of Colorado Boulder LECTURE 21 Slides adapted from Mohri Jordan Boyd-Graber Boulder Online Learning 1 of 31

Motivation PAC learning: distribution fixed over time (training and test), IID assumption. On-line learning: no distributional assumption. worst-case analysis (adversarial). mixed training and test. Performance measure: mistake model, regret. Jordan Boyd-Graber Boulder Online Learning 2 of 31

General Online Setting For t = 1 to T : Get instance x t X Predict ŷ t Y Get true label y t Y Incur loss L(ŷ t, y t ) Classification: Y = {0, 1}, L(y, y ) = y y Regression: Y R, L(y, y ) = (y y) 2 Jordan Boyd-Graber Boulder Online Learning 3 of 31

General Online Setting For t = 1 to T : Get instance x t X Predict ŷ t Y Get true label y t Y Incur loss L(ŷ t, y t ) Classification: Y = {0, 1}, L(y, y ) = y y Regression: Y R, L(y, y ) = (y y) 2 Objective: Minimize total loss t L(ŷ t, y t ) Jordan Boyd-Graber Boulder Online Learning 3 of 31

Plan Experts Perceptron Algorithm Online Perceptron for Structure Learning Jordan Boyd-Graber Boulder Online Learning 4 of 31

Prediction with Expert Advice For t = 1 to T : Get instance x t X and advice a t, i Y, i [1, N] Predict ŷ t Y Get true label y t Y Incur loss L(ŷ t, y t ) Jordan Boyd-Graber Boulder Online Learning 5 of 31

Prediction with Expert Advice For t = 1 to T : Get instance x t X and advice a t, i Y, i [1, N] Predict ŷ t Y Get true label y t Y Incur loss L(ŷ t, y t ) Objective: Minimize regret, i.e., difference of total loss vs. best expert Regret(T ) = L(ŷ t, y t ) min L(a t,i, y t ) (1) i t t Jordan Boyd-Graber Boulder Online Learning 5 of 31

Mistake Bound Model Define the maximum number of mistakes a learning algorithm L makes to learn a concept c over any set of examples (until it s perfect). M L (c) = max x 1,...,x T mistakes(l, c) (2) For any concept class C, this is the max over concepts c. M L (C) = max c C M L(c) (3) Jordan Boyd-Graber Boulder Online Learning 6 of 31

Mistake Bound Model Define the maximum number of mistakes a learning algorithm L makes to learn a concept c over any set of examples (until it s perfect). M L (c) = max x 1,...,x T mistakes(l, c) (2) For any concept class C, this is the max over concepts c. M L (C) = max c C M L(c) (3) In the expert advice case, assumes some expert matches the concept (realizable) Jordan Boyd-Graber Boulder Online Learning 6 of 31

Halving Algorithm H 1 H; for t 1... T do Receive x t ; ŷ t Majority(H t, a t, x t ); Receive y t ; if ŷ t y t then H t+1 {a H t : a(x t ) = y t }; return H T +1 Algorithm 1: The Halving Algorithm (Mitchell, 1997) Jordan Boyd-Graber Boulder Online Learning 7 of 31

Halving Algorithm Bound (Littlestone, 1998) For a finite hypothesis set M Halving(H) lg H (4) After each mistake, the hypothesis set is reduced by at least by half Jordan Boyd-Graber Boulder Online Learning 8 of 31

Halving Algorithm Bound (Littlestone, 1998) For a finite hypothesis set M Halving(H) lg H (4) After each mistake, the hypothesis set is reduced by at least by half Consider the optimal mistake bound opt(h). Then VC(H) opt(h) M Halving(H) lg H (5) For a fully shattered set, form a binary tree of mistakes with height VC(H) Jordan Boyd-Graber Boulder Online Learning 8 of 31

Halving Algorithm Bound (Littlestone, 1998) For a finite hypothesis set M Halving(H) lg H (4) After each mistake, the hypothesis set is reduced by at least by half Consider the optimal mistake bound opt(h). Then VC(H) opt(h) M Halving(H) lg H (5) For a fully shattered set, form a binary tree of mistakes with height VC(H) What about non-realizable case? Jordan Boyd-Graber Boulder Online Learning 8 of 31

Weighted Majority (Littlestone and Warmuth, 1998) for t 1... N do w 1,i 1; for t 1... T do Receive x t ; ŷ t 1 [ y t,i =1 w t y t,i =0 w t Receive y t ; if ŷ t y t then for t 1... N do if ŷ t y t then w t+1,i βw t,i ; else wt+1,i w t,i ] ; Weights for every expert Classifications in favor of side with higher total weight (y {0, 1}) Experts that are wrong get their weights decreased (β [0, 1]) If you re right, you stay unchanged return w T +1 Jordan Boyd-Graber Boulder Online Learning 9 of 31

Weighted Majority (Littlestone and Warmuth, 1998) for t 1... N do w 1,i 1; for t 1... T do Receive x t ; ŷ t 1 [ y t,i =1 w t y t,i =0 w t Receive y t ; if ŷ t y t then for t 1... N do if ŷ t y t then w t+1,i βw t,i ; else wt+1,i w t,i ] ; Weights for every expert Classifications in favor of side with higher total weight (y {0, 1}) Experts that are wrong get their weights decreased (β [0, 1]) If you re right, you stay unchanged return w T +1 Jordan Boyd-Graber Boulder Online Learning 9 of 31

Weighted Majority (Littlestone and Warmuth, 1998) for t 1... N do w 1,i 1; for t 1... T do Receive x t ; ŷ t 1 [ y t,i =1 w t y t,i =0 w t Receive y t ; if ŷ t y t then for t 1... N do if ŷ t y t then w t+1,i βw t,i ; else wt+1,i w t,i ] ; Weights for every expert Classifications in favor of side with higher total weight (y {0, 1}) Experts that are wrong get their weights decreased (β [0, 1]) If you re right, you stay unchanged return w T +1 Jordan Boyd-Graber Boulder Online Learning 9 of 31

Weighted Majority (Littlestone and Warmuth, 1998) for t 1... N do w 1,i 1; for t 1... T do Receive x t ; ŷ t 1 [ y t,i =1 w t y t,i =0 w t Receive y t ; if ŷ t y t then for t 1... N do if ŷ t y t then w t+1,i βw t,i ; else wt+1,i w t,i ] ; Weights for every expert Classifications in favor of side with higher total weight (y {0, 1}) Experts that are wrong get their weights decreased (β [0, 1]) If you re right, you stay unchanged return w T +1 Jordan Boyd-Graber Boulder Online Learning 9 of 31

Weighted Majority Let m t be the number of mistakes made by WM until time t Let m t be the best expert s mistakes until time t m t log N + m t log 1 β log 2 1+β (6) Thus, mistake bound is O(log N) plus the best expert Halving algorithm β = 0 Jordan Boyd-Graber Boulder Online Learning 10 of 31

Proof: Potential Function Potential function is the sum of all weights Φ t i w t,i (7) We ll create sandwich of upper and lower bounds Jordan Boyd-Graber Boulder Online Learning 11 of 31

Proof: Potential Function Potential function is the sum of all weights Φ t i w t,i (7) We ll create sandwich of upper and lower bounds For any expert i, we have lower bound Φ t w t,i = β mt,i (8) Jordan Boyd-Graber Boulder Online Learning 11 of 31

Proof: Potential Function Potential function is the sum of all weights Φ t i w t,i (7) We ll create sandwich of upper and lower bounds For any expert i, we have lower bound Φ t w t,i = β mt,i (8) Weights are nonnegative, so i w t,i w t,i Jordan Boyd-Graber Boulder Online Learning 11 of 31

Proof: Potential Function Potential function is the sum of all weights Φ t i w t,i (7) We ll create sandwich of upper and lower bounds For any expert i, we have lower bound Φ t w t,i = β mt,i (8) Each error multiplicatively reduces weight by β Jordan Boyd-Graber Boulder Online Learning 11 of 31

Proof: Potential Function (Upper Bound) If an algorithm makes an error at round t Φ t+1 Φ t 2 + βφ t 2 (9) Jordan Boyd-Graber Boulder Online Learning 12 of 31

Proof: Potential Function (Upper Bound) If an algorithm makes an error at round t Φ t+1 Φ t 2 + βφ t 2 (9) Half (at most) of the experts by weight were right Jordan Boyd-Graber Boulder Online Learning 12 of 31

Proof: Potential Function (Upper Bound) If an algorithm makes an error at round t Φ t+1 Φ t 2 + βφ t 2 (9) Half (at least) of the experts by weight were wrong Jordan Boyd-Graber Boulder Online Learning 12 of 31

Proof: Potential Function (Upper Bound) If an algorithm makes an error at round t Φ t+1 Φ t 2 + βφ [ ] t 1 + β 2 = Φ t (9) 2 Jordan Boyd-Graber Boulder Online Learning 12 of 31

Proof: Potential Function (Upper Bound) If an algorithm makes an error at round t Φ t+1 Φ t 2 + βφ [ ] t 1 + β 2 = Φ t (9) 2 Initially potential function sums all weights, which start at 1 Φ 1 = N (10) Jordan Boyd-Graber Boulder Online Learning 12 of 31

Proof: Potential Function (Upper Bound) If an algorithm makes an error at round t Φ t+1 Φ t 2 + βφ [ ] t 1 + β 2 = Φ t (9) 2 Initially potential function sums all weights, which start at 1 Φ 1 = N (10) After m T mistakes after T rounds [ ] 1 + β mt Φ T N (11) 2 Jordan Boyd-Graber Boulder Online Learning 12 of 31

Weighted Majority Proof Put the two inequalities together, using the best expert β m T [ ] 1 + β mt ΦT N (12) 2 Jordan Boyd-Graber Boulder Online Learning 13 of 31

Weighted Majority Proof Put the two inequalities together, using the best expert β m T [ ] 1 + β mt ΦT N (12) 2 Take the log of both sides [ ] 1 + β mt log β log N + m T log 2 (13) Jordan Boyd-Graber Boulder Online Learning 13 of 31

Weighted Majority Proof Put the two inequalities together, using the best expert β m T [ ] 1 + β mt ΦT N (12) 2 Take the log of both sides [ ] 1 + β mt log β log N + m T log 2 (13) Solve for m T m T log N + m T log 1 β [ ] (14) log 2 1+β Jordan Boyd-Graber Boulder Online Learning 13 of 31

Weighted Majority Recap Simple algorithm No harsh assumptions (non-realizable) Depends on best learner Downside: Takes a long time to do well in worst case (but okay in practice) Solution: Randomization Jordan Boyd-Graber Boulder Online Learning 14 of 31

Perceptron Algorithm Plan Experts Perceptron Algorithm Online Perceptron for Structure Learning Jordan Boyd-Graber Boulder Online Learning 15 of 31

Perceptron Algorithm Perceptron Algorithm Online algorithm for classification Very similar to logistic regression (but 0/1 loss) But what can we prove? Jordan Boyd-Graber Boulder Online Learning 16 of 31

Perceptron Algorithm Perceptron Algorithm w 1 0; for t 1... T do Receive x t ; ŷ t sgn( w t x t ); Receive y t ; if ŷ t y t then w t+1 w t + y t x t ; else w t+1 w t ; return w T +1 Algorithm 2: Perceptron Algorithm (Rosenblatt, 1958) Jordan Boyd-Graber Boulder Online Learning 17 of 31

Perceptron Algorithm Objective Function Optimizes 1 T Convex but not differentiable max (0, y t ( w x t )) (15) t Jordan Boyd-Graber Boulder Online Learning 18 of 31

Perceptron Algorithm Margin and Errors If there s a good margin ρ, you ll converge quickly Jordan Boyd-Graber Boulder Online Learning 19 of 31

Perceptron Algorithm Margin and Errors If there s a good margin ρ, you ll converge quickly Whenever you se an error, you move the classifier to get it right Convergence only possible if data are separable Jordan Boyd-Graber Boulder Online Learning 19 of 31

Perceptron Algorithm How many errors does Perceptron make? If your data are in a R ball and there is a margin ρ y t( v x t ) v (16) for some v, then the number of mistakes is bounded by R 2 /ρ 2 The places where you make an error are support vectors Convergence can be slow for small margins Jordan Boyd-Graber Boulder Online Learning 20 of 31

Online Perceptron for Structure Learning Plan Experts Perceptron Algorithm Online Perceptron for Structure Learning Jordan Boyd-Graber Boulder Online Learning 21 of 31

Online Perceptron for Structure Learning Binary to Structure Jordan Boyd-Graber Boulder Online Learning 22 of 31

Online Perceptron for Structure Learning Binary to Structure Jordan Boyd-Graber Boulder Online Learning 22 of 31

Online Perceptron for Structure Learning Binary to Structure Jordan Boyd-Graber Boulder Online Learning 22 of 31

Online Perceptron for Structure Learning Generic Perceptron perceptron is the simplest machine learning algorithm online-learning: one example at a time learning by doing find the best output under the current weights update weights at mistakes Jordan Boyd-Graber Boulder Online Learning 23 of 31

Online Perceptron for Structure Learning Structured Perceptron Jordan Boyd-Graber Boulder Online Learning 24 of 31

Online Perceptron for Structure Learning Perceptron Algorithm Jordan Boyd-Graber Boulder Online Learning 25 of 31

Online Perceptron for Structure Learning POS Example Jordan Boyd-Graber Boulder Online Learning 26 of 31

Online Perceptron for Structure Learning What must be true? Finding highest scoring structure must be really fast (you ll do it often) Requires some sort of dynamic programming algorithm For tagging: features must be local to y (but can be global to x) Jordan Boyd-Graber Boulder Online Learning 27 of 31

Online Perceptron for Structure Learning Averaging is Good Jordan Boyd-Graber Boulder Online Learning 28 of 31

Online Perceptron for Structure Learning Averaging is Good Jordan Boyd-Graber Boulder Online Learning 28 of 31

Online Perceptron for Structure Learning Smoothing Must include subset templates for features For example, if you have feature (t 0, w 0, w 1 ), you must also have (t 0, w 0 ); (t 0, w 1 ); (w 0, w 1 ) Jordan Boyd-Graber Boulder Online Learning 29 of 31

Online Perceptron for Structure Learning Inexact Search? Sometimes search is too hard So we use beam search instead How to create algorithms that respect this relaxation: track when right answer falls off the beam Jordan Boyd-Graber Boulder Online Learning 30 of 31

Online Perceptron for Structure Learning Wrapup Structured prediction: when one label isn t enough Generative models can help with not a lot of data Discriminative models are state of the art Jordan Boyd-Graber Boulder Online Learning 31 of 31

Online Perceptron for Structure Learning Wrapup Structured prediction: when one label isn t enough Generative models can help with not a lot of data Discriminative models are state of the art More in Natural Language Processing (at least when I teach it) Jordan Boyd-Graber Boulder Online Learning 31 of 31