Efficient and Principled Online Classification Algorithms for Lifelon

Similar documents
Better Algorithms for Selective Sampling

PAC-learning, VC Dimension and Margin-based Bounds

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

Adaptive Sampling Under Low Noise Conditions 1

Warm up: risk prediction with logistic regression

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

Littlestone s Dimension and Online Learnability

ECE 5984: Introduction to Machine Learning

Support vector machines Lecture 4

ECE521 week 3: 23/26 January 2017

Lecture 5: GPs and Streaming regression

PAC-learning, VC Dimension and Margin-based Bounds

The Perceptron Algorithm

Linear Classification and Selective Sampling Under Low Noise Conditions

Learning Theory Continued

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

Midterm: CS 6375 Spring 2015 Solutions

Introduction to Machine Learning Midterm Exam


Statistical Machine Learning from Data

Online Learning and Sequential Decision Making

Naïve Bayes classification

Logistic Regression Logistic

Machine Learning (CS 567) Lecture 2

Robust Selective Sampling from Single and Multiple Teachers

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

Online Learning. Jordan Boyd-Graber. University of Colorado Boulder LECTURE 21. Slides adapted from Mohri

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Optimal and Adaptive Online Learning

Midterm Exam, Spring 2005

ECE 5424: Introduction to Machine Learning

Optimal and Adaptive Algorithms for Online Boosting

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Statistical Learning Theory: Generalization Error Bounds

FINAL EXAM: FALL 2013 CS 6375 INSTRUCTOR: VIBHAV GOGATE

CS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims

Machine Learning

CS446: Machine Learning Spring Problem Set 4

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline

Bagging and Other Ensemble Methods

TDT4173 Machine Learning

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Efficient Bandit Algorithms for Online Multiclass Prediction

Bias-Variance in Machine Learning

Support Vector Machines and Bayes Regression

Perceptron (Theory) + Linear Regression

Lecture 8. Instructor: Haipeng Luo

Introduction to Machine Learning Midterm Exam Solutions

Does Unlabeled Data Help?

Evaluating Classifiers. Lecture 2 Instructor: Max Welling

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

Learning Theory. Aar$ Singh and Barnabas Poczos. Machine Learning / Apr 17, Slides courtesy: Carlos Guestrin

Machine Learning

Bias-Variance Tradeoff

Machine Learning Practice Page 2 of 2 10/28/13

6.867 Machine learning

CS 188: Artificial Intelligence Spring Announcements

Machine Learning, Fall 2011: Homework 5

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

Midterm, Fall 2003

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Day 5: Generative models, structured classification

Supervised Learning Coursework

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Foundations For Learning in the Age of Big Data. Maria-Florina Balcan

Sample and Computationally Efficient Active Learning. Maria-Florina Balcan Carnegie Mellon University

Generalization, Overfitting, and Model Selection

Active Learning and Optimized Information Gathering

Learning theory Lecture 4

COMP90051 Statistical Machine Learning

Outline: Ensemble Learning. Ensemble Learning. The Wisdom of Crowds. The Wisdom of Crowds - Really? Crowd wiser than any individual

Understanding Generalization Error: Bounds and Decompositions

Introduction to Probabilistic Machine Learning

CS7267 MACHINE LEARNING

Stochastic Gradient Descent

Active Learning Class 22, 03 May Claire Monteleoni MIT CSAIL

COMP 562: Introduction to Machine Learning

Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

Machine Learning in the Data Revolution Era

CPSC 340: Machine Learning and Data Mining

Learning with multiple models. Boosting.

Advanced Machine Learning Practical 4b Solution: Regression (BLR, GPR & Gradient Boosting)

The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m

Data Mining und Maschinelles Lernen

Online Learning, Mistake Bounds, Perceptron Algorithm

On the Generalization Ability of Online Strongly Convex Programming Algorithms

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

PAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018

Perceptron Mistake Bounds

Linear & nonlinear classifiers

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research

12.1 A Polynomial Bound on the Sample Size m for PAC Learning

Predicting Graph Labels using Perceptron. Shuang Song

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Computational Learning Theory

Introduction. Chapter 1

Bounded Kernel-Based Online Learning

CSE 546 Final Exam, Autumn 2013

Active and Semi-supervised Kernel Classification

Transcription:

Efficient and Principled Online Classification Algorithms for Lifelong Learning Toyota Technological Institute at Chicago Chicago, IL USA Talk @ Lifelong Learning for Mobile Robotics Applications Workshop, IROS 2012, Vila Moura, Portugal

Something about me PhD in Robotics at LIRA-Lab, University of Genova PostDocs in Machine Learning

Something about me PhD in Robotics at LIRA-Lab, University of Genova PostDocs in Machine Learning I like theoretical motivated algorithms

Something about me PhD in Robotics at LIRA-Lab, University of Genova PostDocs in Machine Learning I like theoretical motivated algorithms But they must work well too! ;-)

Lifelong Learning: why? In standard Machine Learning we have always two steps training on some data stop train and use the system on some other data

Lifelong Learning: why? In standard Machine Learning we have always two steps training on some data stop train and use the system on some other data This approach is known to be doomed to fail: environment changes continuously over time and it is impossible to predict how.

Lifelong Learning: how? Continuous adaptation and learning is the only possibility! My contributions to the field learning with bounded memory on infinite samples (ICML08, JMLR09) transfer learning, to bootstrap new classifiers (ICRA09, CVPR10, BMVC12, IEEE trans. Robotics "Soon") selective sampling, to know when to ask for more data (ICML09, ICML11)

Lifelong Learning: how? Continuous adaptation and learning is the only possibility! My contributions to the field learning with bounded memory on infinite samples (ICML08, JMLR09) transfer learning, to bootstrap new classifiers (ICRA09, CVPR10, BMVC12, IEEE trans. Robotics "Soon") selective sampling, to know when to ask for more data (ICML09, ICML11)

Outline 1 2 3

Outline 1 2 3

Cafeteria, and I am really sure about it

Corridor, but I might be wrong. Please give me some feedback

No! It's Office Corridor, but I might be wrong. Please give me some feedback

Thank you, I will update my classifier

Outline 1 2 3

Selective sampling is a well-known semi-supervised online learning setting (Cohn et al., 1990). At each step t = 1, 2,... the learner receives an instance x t R d and outputs a binary prediction for the associated unknown label y t {0, 1}. After each prediction the learner may observe the label y t only by issuing a query. If no query is issued at time t, then y t remains unknown. Since one expects the learner s performance to improve if more labels are observed, our goal is to trade off predictive accuracy against number of queries.

in the Real World We want to solve the problem in the Real World This means: Efficient algorithms No unrealistic assumptions No parameters to tune to obtain convergence and/or good performance The order of the samples must be adversarial

Warming up: the Coin toss Somebody asks us the estimate what is the probability to obtain head on a loaded coin

Warming up: the Coin toss Somebody asks us the estimate what is the probability to obtain head on a loaded coin We can toss it N times and estimate the probability as number of heads N.

Warming up: the Coin toss Somebody asks us the estimate what is the probability to obtain head on a loaded coin We can toss it N times and estimate the probability as number of heads N. How do we choose N?

Warming up: the Coin toss Somebody asks us the estimate what is the probability to obtain head on a loaded coin We can toss it N times and estimate the probability as number of heads N. How do we choose N? 1 If we toss the coin at least log 2 2ɛ 2 δ times, then with probability at least 1 δ the true probability will be in [ ɛ, + ɛ]. number of heads N number of heads N

Warming up: the Coin toss Somebody asks us the estimate what is the probability to obtain head on a loaded coin We can toss it N times and estimate the probability as number of heads N. How do we choose N? 1 If we toss the coin at least log 2 2ɛ 2 δ times, then with probability at least 1 δ the true probability will be in [ ɛ, + ɛ]. number of heads N number of heads N Example: ɛ = 0.1, Probability= 95% N 35

Why do we care about coin toss? Suppose there is a probability P(Y X = x t ) Each time we receive a sample, the label is given by a coin toss Asking for the same label many times, we could estimate it

Why do we care about coin toss? Suppose there is a probability P(Y X = x t ) Each time we receive a sample, the label is given by a coin toss Asking for the same label many times, we could estimate it Main idea: assume that the function P(Y X = x) has some regularities, so that close points have similar probabilities. Gathering points in a neighboorhood will increase the confidence in the estimate of all the points in it!

A Step Further: Coins on a Line 1 Coins on a line (5 coin tosses) 0.9 0.8 Predicted and True Coin Probabilities 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1 X

A Step Further: Coins on a Line 1 Coins on a line (about 100 coin tosses) 0.9 0.8 Predicted and True Coin Probabilities 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1 X

A Step Further: Coins on a Line 1 Coins on a line (about 300 coin tosses) 0.9 0.8 Predicted and True Coin Probabilities 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1 X

A Step Further: Coins on a Line 1 Coins on a line (about 1200 coin tosses) 0.9 0.8 Predicted and True Coin Probabilities 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1 X

A Step Further: Coins on a Line 1 Coins on a line (unknown coin) 0.9 0.8 Predicted and True Coin Probabilities 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1 X

An easy case Very easy assumption on P(Y X = x): it is linear in x Of course this model is a bit restrictive, but...

An easy case Very easy assumption on P(Y X = x): it is linear in x Of course this model is a bit restrictive, but......we can use the kernel trick! P(Y X = x) = u, φ(x)

An easy case Very easy assumption on P(Y X = x): it is linear in x Of course this model is a bit restrictive, but......we can use the kernel trick! P(Y X = x) = u, φ(x) Is this model powerful enough?

An easy case Very easy assumption on P(Y X = x): it is linear in x Of course this model is a bit restrictive, but......we can use the kernel trick! P(Y X = x) = u, φ(x) Is this model powerful enough? Yes, if the P(Y X = x) is continuos and the kernel universal (e.g. Gaussian Kernel)

An easy case Very easy assumption on P(Y X = x): it is linear in x Of course this model is a bit restrictive, but......we can use the kernel trick! P(Y X = x) = u, φ(x) Is this model powerful enough? Yes, if the P(Y X = x) is continuos and the kernel universal (e.g. Gaussian Kernel) It implies that is possible to approximate P(Y X = x), with u, φ(x t )

To summarize Assume that the outputs y t comes from a distribution P(Y X = x), continuos w.r.t. x t Assume to use a universal kernel, e.g. the gaussian kernel. Note that the order of the x t is adversarial, i.e. the samples can arrive in any possible order. These hypothesis are enough to solve our problem, and the algorithm is even efficient :-)

The Bound on Bias Query () Algorithm Parameters: κ for t = 1, 2,..., T do Receive new instance x t Predict ŷ t = sign( w, x t ) ( ) r = x t I + At + x t x 1 t x t if r t κ then Query label y t Update w with a Regularized Least Square A t+1 = A t + x t x t end if end for Every time a query is not issued the predicted output is not too far from the correct one N. Cesa-Bianchi, C. Gentile and F. Orabona. Robust Bounds for Classification via. ICML 2009

The Bound on Bias Query () Algorithm Parameters: κ for t = 1, 2,..., T do Receive new instance x t Predict ŷ t = sign( w, x t ) ( ) r = x t I + At + x t x 1 t x t if r t κ then Query label y t Update w with a Regularized Least Square A t+1 = A t + x t x t end if end for Every time a query is not issued the predicted output is not too far from the correct one N. Cesa-Bianchi, C. Gentile and F. Orabona. Robust Bounds for Classification via. ICML 2009

How Does It Work? The RLS prediction is an estimate of the true probablity, with a certain bias and variance. issues a query when a r is small, beacause r is proportional to a common upper bound on bias and variance of the current RLS estimate. Hence, when r is small the learner can safely avoid issuing a query on that step, because it knows it will be close enough to the true probability.

Theoretical Analysis How do we measure the performance of this algorithm? We use a relative comparison: how good is it w.r.t. the best classifier ever? The best classifier is the one that is trained with the knowledge of all the samples, and same kernel. We define the cumulative regret after T steps R T = ( ) T t=1 P(Y t t < 0) P(Y t t < 0)

Regret Bound of (2011 Version) Theorem If is run with input 0 < κ < 1 then the number of queried labels is N T T κ log A T its cumulative regret R T, with probability at least 1 δ, is less than ( u 2 min ε T ε + O log N ) T 0<ε<1 ε δ log A T + O ( ( u ) ) 2/κ where T ε = {1 t T : t < ε}, A T = queries x tx t. ε F. Orabona, N. Cesa-Bianchi. Better Algorithms for. ICML 2011

Regret Bound of (2011 Version) Theorem If is run with input 0 < κ < 1 then the number of queried labels is N T T κ log A T its cumulative regret R T, with probability at least 1 δ, is less than ( u 2 min ε T ε + O log N ) T 0<ε<1 ε δ log A T + O ( ( u ) ) 2/κ where T ε = {1 t T : t < ε}, A T = queries x tx t. ε F. Orabona, N. Cesa-Bianchi. Better Algorithms for. ICML 2011

Regret Bound of (2011 Version) Theorem If is run with input 0 < κ < 1 then the number of queried labels is N T T κ log A T its cumulative regret R T, with probability at least 1 δ, is less than ( u 2 min ε T ε + O log N ) T 0<ε<1 ε δ log A T + O ( ( u ) ) 2/κ where T ε = {1 t T : t < ε}, A T = queries x tx t. ε F. Orabona, N. Cesa-Bianchi. Better Algorithms for. ICML 2011

Regret Bound of (2011 Version) Theorem If is run with input 0 < κ < 1 then the number of queried labels is N T T κ log A T its cumulative regret R T, with probability at least 1 δ, is less than ( ( u ) ) 2/κ min Regret classifier that asks all the labels + O 0<ε<1 ε F. Orabona, N. Cesa-Bianchi. Better Algorithms for. ICML 2011

Regret Bound of (2011 Version) Theorem If is run with input 0 < κ < 1 then the number of queried labels is N T T κ log A T its cumulative regret R T, with probability at least 1 δ, is less than ( ( u ) ) 2/κ min Regret classifier that asks all the labels + O 0<ε<1 ε F. Orabona, N. Cesa-Bianchi. Better Algorithms for. ICML 2011

r does not depend on the labels, hence the query condition ignores all the labels. Probably we are loosing precious information. W tt x t + c t W t T x t W tt x t c t 0

r does not depend on the labels, hence the query condition ignores all the labels. Probably we are loosing precious information. Why the margin is important? W tt x t + c t W t T x t W tt x t c t 0

r does not depend on the labels, hence the query condition ignores all the labels. Probably we are loosing precious information. Why the margin is important? Suppose w x t > c t, we know that u x t w x t c t > 0 W tt x t + c t W t T x t W tt x t c t 0

Algorithm Parameter: α > 0, 0 < δ < 1 Initialization: Vector w = 0, matrix A 1 = I for each time step t = 1, 2,... do Observe instance x t and set t = w x t Predict label with SGN( ˆ t ) if ˆ 2 t 2α ( x t A 1 ) t 1 t x t (4 s=1 Z sr s + 36 ln(t/δ)) log t then Query label y t w = w SGN( t ) ˆ t 1 A 1 x t A 1 t x t t x t + A t+1 = A t + x t x t and r t = x t A 1 t+1 x t Update w with a Regularized Least Square else A t+1 = A t, r t = 0 end if end for Kernels can be used, formulating the algorithm in dual variables.

Algorithm Parameter: α > 0, 0 < δ < 1 Initialization: Vector w = 0, matrix A 1 = I for each time step t = 1, 2,... do Observe instance x t and set t = w x t Predict label with SGN( ˆ t ) if ˆ 2 t 2α ( x t A 1 ) t 1 t x t (4 s=1 Z sr s + 36 ln(t/δ)) log t then Query label y t w = w SGN( t ) ˆ t 1 A 1 x t A 1 t x t t x t + A t+1 = A t + x t x t and r t = x t A 1 t+1 x t Update w with a Regularized Least Square else A t+1 = A t, r t = 0 end if end for Kernels can be used, formulating the algorithm in dual variables.

Guarantees Theorem After any number T of steps, with probability at least 1 δ, the cumulative regret R T of with α > 0 is less than ( )) min 0<ε<1 ε T ε + O( u 2 + ln A T +1 + ln T δ ε and the number N T of queries is less than min 0<ε<1 ( T ε + O ( ln AT +1 + e u 2 +1 α [ u 2 + ( 1 + α ln T ) ( ln A T +1 + ln T δ ε 2 ) ] )). F. Orabona, N. Cesa-Bianchi. Better Algorithms for. ICML 2011

Outline 1 2 3

Synthetic Experiment 10,000 random examples on the unit circle in R 100, x t 1. The labels were generated according to our noise model. has the best trade-off between performance and queries.

Real World Experiments Adult dataset with 32000 samples and gaussian kernel.

Summary Real world problems often deviate from the standard training/test IID data Theory can (and should?) be used to model the real world problems with real world hypothesis In particular Continuity of the marginal distribution and universal kernels allow us to design efficient and principled algorithms to solve the selective sampling problem.

Summary Real world problems often deviate from the standard training/test IID data Theory can (and should?) be used to model the real world problems with real world hypothesis In particular Continuity of the marginal distribution and universal kernels allow us to design efficient and principled algorithms to solve the selective sampling problem. What problems can you solve with these tools? :-)

Thanks for your attention Code: http://dogma.sourceforge.net My website: http://francesco.orabona.com Minimal Bibliography: Cesa-Bianchi, Gentile, Orabona. ICML 09 Dekel, Gentile, Sridharan. COLT 10 Orabona, Cesa-Bianchi. ICML 11