Online Learning. Jordan Boyd-Graber. University of Colorado Boulder LECTURE 21. Slides adapted from Mohri

Size: px
Start display at page:

Download "Online Learning. Jordan Boyd-Graber. University of Colorado Boulder LECTURE 21. Slides adapted from Mohri"

Transcription

1 Online Learning Jordan Boyd-Graber University of Colorado Boulder LECTURE 21 Slides adapted from Mohri Jordan Boyd-Graber Boulder Online Learning 1 of 31

2 Motivation PAC learning: distribution fixed over time (training and test), IID assumption. On-line learning: no distributional assumption. worst-case analysis (adversarial). mixed training and test. Performance measure: mistake model, regret. Jordan Boyd-Graber Boulder Online Learning 2 of 31

3 General Online Setting For t = 1 to T : Get instance x t X Predict ŷ t Y Get true label y t Y Incur loss L(ŷ t, y t ) Classification: Y = {0, 1}, L(y, y ) = y y Regression: Y R, L(y, y ) = (y y) 2 Jordan Boyd-Graber Boulder Online Learning 3 of 31

4 General Online Setting For t = 1 to T : Get instance x t X Predict ŷ t Y Get true label y t Y Incur loss L(ŷ t, y t ) Classification: Y = {0, 1}, L(y, y ) = y y Regression: Y R, L(y, y ) = (y y) 2 Objective: Minimize total loss t L(ŷ t, y t ) Jordan Boyd-Graber Boulder Online Learning 3 of 31

5 Plan Experts Perceptron Algorithm Online Perceptron for Structure Learning Jordan Boyd-Graber Boulder Online Learning 4 of 31

6 Prediction with Expert Advice For t = 1 to T : Get instance x t X and advice a t, i Y, i [1, N] Predict ŷ t Y Get true label y t Y Incur loss L(ŷ t, y t ) Jordan Boyd-Graber Boulder Online Learning 5 of 31

7 Prediction with Expert Advice For t = 1 to T : Get instance x t X and advice a t, i Y, i [1, N] Predict ŷ t Y Get true label y t Y Incur loss L(ŷ t, y t ) Objective: Minimize regret, i.e., difference of total loss vs. best expert Regret(T ) = L(ŷ t, y t ) min L(a t,i, y t ) (1) i t t Jordan Boyd-Graber Boulder Online Learning 5 of 31

8 Mistake Bound Model Define the maximum number of mistakes a learning algorithm L makes to learn a concept c over any set of examples (until it s perfect). M L (c) = max x 1,...,x T mistakes(l, c) (2) For any concept class C, this is the max over concepts c. M L (C) = max c C M L(c) (3) Jordan Boyd-Graber Boulder Online Learning 6 of 31

9 Mistake Bound Model Define the maximum number of mistakes a learning algorithm L makes to learn a concept c over any set of examples (until it s perfect). M L (c) = max x 1,...,x T mistakes(l, c) (2) For any concept class C, this is the max over concepts c. M L (C) = max c C M L(c) (3) In the expert advice case, assumes some expert matches the concept (realizable) Jordan Boyd-Graber Boulder Online Learning 6 of 31

10 Halving Algorithm H 1 H; for t 1... T do Receive x t ; ŷ t Majority(H t, a t, x t ); Receive y t ; if ŷ t y t then H t+1 {a H t : a(x t ) = y t }; return H T +1 Algorithm 1: The Halving Algorithm (Mitchell, 1997) Jordan Boyd-Graber Boulder Online Learning 7 of 31

11 Halving Algorithm Bound (Littlestone, 1998) For a finite hypothesis set M Halving(H) lg H (4) After each mistake, the hypothesis set is reduced by at least by half Jordan Boyd-Graber Boulder Online Learning 8 of 31

12 Halving Algorithm Bound (Littlestone, 1998) For a finite hypothesis set M Halving(H) lg H (4) After each mistake, the hypothesis set is reduced by at least by half Consider the optimal mistake bound opt(h). Then VC(H) opt(h) M Halving(H) lg H (5) For a fully shattered set, form a binary tree of mistakes with height VC(H) Jordan Boyd-Graber Boulder Online Learning 8 of 31

13 Halving Algorithm Bound (Littlestone, 1998) For a finite hypothesis set M Halving(H) lg H (4) After each mistake, the hypothesis set is reduced by at least by half Consider the optimal mistake bound opt(h). Then VC(H) opt(h) M Halving(H) lg H (5) For a fully shattered set, form a binary tree of mistakes with height VC(H) What about non-realizable case? Jordan Boyd-Graber Boulder Online Learning 8 of 31

14 Weighted Majority (Littlestone and Warmuth, 1998) for t 1... N do w 1,i 1; for t 1... T do Receive x t ; ŷ t 1 [ y t,i =1 w t y t,i =0 w t Receive y t ; if ŷ t y t then for t 1... N do if ŷ t y t then w t+1,i βw t,i ; else wt+1,i w t,i ] ; Weights for every expert Classifications in favor of side with higher total weight (y {0, 1}) Experts that are wrong get their weights decreased (β [0, 1]) If you re right, you stay unchanged return w T +1 Jordan Boyd-Graber Boulder Online Learning 9 of 31

15 Weighted Majority (Littlestone and Warmuth, 1998) for t 1... N do w 1,i 1; for t 1... T do Receive x t ; ŷ t 1 [ y t,i =1 w t y t,i =0 w t Receive y t ; if ŷ t y t then for t 1... N do if ŷ t y t then w t+1,i βw t,i ; else wt+1,i w t,i ] ; Weights for every expert Classifications in favor of side with higher total weight (y {0, 1}) Experts that are wrong get their weights decreased (β [0, 1]) If you re right, you stay unchanged return w T +1 Jordan Boyd-Graber Boulder Online Learning 9 of 31

16 Weighted Majority (Littlestone and Warmuth, 1998) for t 1... N do w 1,i 1; for t 1... T do Receive x t ; ŷ t 1 [ y t,i =1 w t y t,i =0 w t Receive y t ; if ŷ t y t then for t 1... N do if ŷ t y t then w t+1,i βw t,i ; else wt+1,i w t,i ] ; Weights for every expert Classifications in favor of side with higher total weight (y {0, 1}) Experts that are wrong get their weights decreased (β [0, 1]) If you re right, you stay unchanged return w T +1 Jordan Boyd-Graber Boulder Online Learning 9 of 31

17 Weighted Majority (Littlestone and Warmuth, 1998) for t 1... N do w 1,i 1; for t 1... T do Receive x t ; ŷ t 1 [ y t,i =1 w t y t,i =0 w t Receive y t ; if ŷ t y t then for t 1... N do if ŷ t y t then w t+1,i βw t,i ; else wt+1,i w t,i ] ; Weights for every expert Classifications in favor of side with higher total weight (y {0, 1}) Experts that are wrong get their weights decreased (β [0, 1]) If you re right, you stay unchanged return w T +1 Jordan Boyd-Graber Boulder Online Learning 9 of 31

18 Weighted Majority Let m t be the number of mistakes made by WM until time t Let m t be the best expert s mistakes until time t m t log N + m t log 1 β log 2 1+β (6) Thus, mistake bound is O(log N) plus the best expert Halving algorithm β = 0 Jordan Boyd-Graber Boulder Online Learning 10 of 31

19 Proof: Potential Function Potential function is the sum of all weights Φ t i w t,i (7) We ll create sandwich of upper and lower bounds Jordan Boyd-Graber Boulder Online Learning 11 of 31

20 Proof: Potential Function Potential function is the sum of all weights Φ t i w t,i (7) We ll create sandwich of upper and lower bounds For any expert i, we have lower bound Φ t w t,i = β mt,i (8) Jordan Boyd-Graber Boulder Online Learning 11 of 31

21 Proof: Potential Function Potential function is the sum of all weights Φ t i w t,i (7) We ll create sandwich of upper and lower bounds For any expert i, we have lower bound Φ t w t,i = β mt,i (8) Weights are nonnegative, so i w t,i w t,i Jordan Boyd-Graber Boulder Online Learning 11 of 31

22 Proof: Potential Function Potential function is the sum of all weights Φ t i w t,i (7) We ll create sandwich of upper and lower bounds For any expert i, we have lower bound Φ t w t,i = β mt,i (8) Each error multiplicatively reduces weight by β Jordan Boyd-Graber Boulder Online Learning 11 of 31

23 Proof: Potential Function (Upper Bound) If an algorithm makes an error at round t Φ t+1 Φ t 2 + βφ t 2 (9) Jordan Boyd-Graber Boulder Online Learning 12 of 31

24 Proof: Potential Function (Upper Bound) If an algorithm makes an error at round t Φ t+1 Φ t 2 + βφ t 2 (9) Half (at most) of the experts by weight were right Jordan Boyd-Graber Boulder Online Learning 12 of 31

25 Proof: Potential Function (Upper Bound) If an algorithm makes an error at round t Φ t+1 Φ t 2 + βφ t 2 (9) Half (at least) of the experts by weight were wrong Jordan Boyd-Graber Boulder Online Learning 12 of 31

26 Proof: Potential Function (Upper Bound) If an algorithm makes an error at round t Φ t+1 Φ t 2 + βφ [ ] t 1 + β 2 = Φ t (9) 2 Jordan Boyd-Graber Boulder Online Learning 12 of 31

27 Proof: Potential Function (Upper Bound) If an algorithm makes an error at round t Φ t+1 Φ t 2 + βφ [ ] t 1 + β 2 = Φ t (9) 2 Initially potential function sums all weights, which start at 1 Φ 1 = N (10) Jordan Boyd-Graber Boulder Online Learning 12 of 31

28 Proof: Potential Function (Upper Bound) If an algorithm makes an error at round t Φ t+1 Φ t 2 + βφ [ ] t 1 + β 2 = Φ t (9) 2 Initially potential function sums all weights, which start at 1 Φ 1 = N (10) After m T mistakes after T rounds [ ] 1 + β mt Φ T N (11) 2 Jordan Boyd-Graber Boulder Online Learning 12 of 31

29 Weighted Majority Proof Put the two inequalities together, using the best expert β m T [ ] 1 + β mt ΦT N (12) 2 Jordan Boyd-Graber Boulder Online Learning 13 of 31

30 Weighted Majority Proof Put the two inequalities together, using the best expert β m T [ ] 1 + β mt ΦT N (12) 2 Take the log of both sides [ ] 1 + β mt log β log N + m T log 2 (13) Jordan Boyd-Graber Boulder Online Learning 13 of 31

31 Weighted Majority Proof Put the two inequalities together, using the best expert β m T [ ] 1 + β mt ΦT N (12) 2 Take the log of both sides [ ] 1 + β mt log β log N + m T log 2 (13) Solve for m T m T log N + m T log 1 β [ ] (14) log 2 1+β Jordan Boyd-Graber Boulder Online Learning 13 of 31

32 Weighted Majority Recap Simple algorithm No harsh assumptions (non-realizable) Depends on best learner Downside: Takes a long time to do well in worst case (but okay in practice) Solution: Randomization Jordan Boyd-Graber Boulder Online Learning 14 of 31

33 Perceptron Algorithm Plan Experts Perceptron Algorithm Online Perceptron for Structure Learning Jordan Boyd-Graber Boulder Online Learning 15 of 31

34 Perceptron Algorithm Perceptron Algorithm Online algorithm for classification Very similar to logistic regression (but 0/1 loss) But what can we prove? Jordan Boyd-Graber Boulder Online Learning 16 of 31

35 Perceptron Algorithm Perceptron Algorithm w 1 0; for t 1... T do Receive x t ; ŷ t sgn( w t x t ); Receive y t ; if ŷ t y t then w t+1 w t + y t x t ; else w t+1 w t ; return w T +1 Algorithm 2: Perceptron Algorithm (Rosenblatt, 1958) Jordan Boyd-Graber Boulder Online Learning 17 of 31

36 Perceptron Algorithm Objective Function Optimizes 1 T Convex but not differentiable max (0, y t ( w x t )) (15) t Jordan Boyd-Graber Boulder Online Learning 18 of 31

37 Perceptron Algorithm Margin and Errors If there s a good margin ρ, you ll converge quickly Jordan Boyd-Graber Boulder Online Learning 19 of 31

38 Perceptron Algorithm Margin and Errors If there s a good margin ρ, you ll converge quickly Whenever you se an error, you move the classifier to get it right Convergence only possible if data are separable Jordan Boyd-Graber Boulder Online Learning 19 of 31

39 Perceptron Algorithm How many errors does Perceptron make? If your data are in a R ball and there is a margin ρ y t( v x t ) v (16) for some v, then the number of mistakes is bounded by R 2 /ρ 2 The places where you make an error are support vectors Convergence can be slow for small margins Jordan Boyd-Graber Boulder Online Learning 20 of 31

40 Online Perceptron for Structure Learning Plan Experts Perceptron Algorithm Online Perceptron for Structure Learning Jordan Boyd-Graber Boulder Online Learning 21 of 31

41 Online Perceptron for Structure Learning Binary to Structure Jordan Boyd-Graber Boulder Online Learning 22 of 31

42 Online Perceptron for Structure Learning Binary to Structure Jordan Boyd-Graber Boulder Online Learning 22 of 31

43 Online Perceptron for Structure Learning Binary to Structure Jordan Boyd-Graber Boulder Online Learning 22 of 31

44 Online Perceptron for Structure Learning Generic Perceptron perceptron is the simplest machine learning algorithm online-learning: one example at a time learning by doing find the best output under the current weights update weights at mistakes Jordan Boyd-Graber Boulder Online Learning 23 of 31

45 Online Perceptron for Structure Learning Structured Perceptron Jordan Boyd-Graber Boulder Online Learning 24 of 31

46 Online Perceptron for Structure Learning Perceptron Algorithm Jordan Boyd-Graber Boulder Online Learning 25 of 31

47 Online Perceptron for Structure Learning POS Example Jordan Boyd-Graber Boulder Online Learning 26 of 31

48 Online Perceptron for Structure Learning What must be true? Finding highest scoring structure must be really fast (you ll do it often) Requires some sort of dynamic programming algorithm For tagging: features must be local to y (but can be global to x) Jordan Boyd-Graber Boulder Online Learning 27 of 31

49 Online Perceptron for Structure Learning Averaging is Good Jordan Boyd-Graber Boulder Online Learning 28 of 31

50 Online Perceptron for Structure Learning Averaging is Good Jordan Boyd-Graber Boulder Online Learning 28 of 31

51 Online Perceptron for Structure Learning Smoothing Must include subset templates for features For example, if you have feature (t 0, w 0, w 1 ), you must also have (t 0, w 0 ); (t 0, w 1 ); (w 0, w 1 ) Jordan Boyd-Graber Boulder Online Learning 29 of 31

52 Online Perceptron for Structure Learning Inexact Search? Sometimes search is too hard So we use beam search instead How to create algorithms that respect this relaxation: track when right answer falls off the beam Jordan Boyd-Graber Boulder Online Learning 30 of 31

53 Online Perceptron for Structure Learning Wrapup Structured prediction: when one label isn t enough Generative models can help with not a lot of data Discriminative models are state of the art Jordan Boyd-Graber Boulder Online Learning 31 of 31

54 Online Perceptron for Structure Learning Wrapup Structured prediction: when one label isn t enough Generative models can help with not a lot of data Discriminative models are state of the art More in Natural Language Processing (at least when I teach it) Jordan Boyd-Graber Boulder Online Learning 31 of 31

Classification. Jordan Boyd-Graber University of Maryland WEIGHTED MAJORITY. Slides adapted from Mohri. Jordan Boyd-Graber UMD Classification 1 / 13

Classification. Jordan Boyd-Graber University of Maryland WEIGHTED MAJORITY. Slides adapted from Mohri. Jordan Boyd-Graber UMD Classification 1 / 13 Classification Jordan Boyd-Graber University of Maryland WEIGHTED MAJORITY Slides adapted from Mohri Jordan Boyd-Graber UMD Classification 1 / 13 Beyond Binary Classification Before we ve talked about

More information

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research Foundations of Machine Learning On-Line Learning Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Motivation PAC learning: distribution fixed over time (training and test). IID assumption.

More information

Inexact Search is Good Enough

Inexact Search is Good Enough Inexact Search is Good Enough Advanced Machine Learning for NLP Jordan Boyd-Graber MATHEMATICAL TREATMENT Advanced Machine Learning for NLP Boyd-Graber Inexact Search is Good Enough 1 of 1 Preliminaries:

More information

Mistake Bound Model, Halving Algorithm, Linear Classifiers, & Perceptron

Mistake Bound Model, Halving Algorithm, Linear Classifiers, & Perceptron Stat 928: Statistical Learning Theory Lecture: 18 Mistake Bound Model, Halving Algorithm, Linear Classifiers, & Perceptron Instructor: Sham Kakade 1 Introduction This course will be divided into 2 parts.

More information

Online Learning, Mistake Bounds, Perceptron Algorithm

Online Learning, Mistake Bounds, Perceptron Algorithm Online Learning, Mistake Bounds, Perceptron Algorithm 1 Online Learning So far the focus of the course has been on batch learning, where algorithms are presented with a sample of training data, from which

More information

The Perceptron algorithm

The Perceptron algorithm The Perceptron algorithm Tirgul 3 November 2016 Agnostic PAC Learnability A hypothesis class H is agnostic PAC learnable if there exists a function m H : 0,1 2 N and a learning algorithm with the following

More information

CPSC 340: Machine Learning and Data Mining

CPSC 340: Machine Learning and Data Mining CPSC 340: Machine Learning and Data Mining Linear Classifiers: predictions Original version of these slides by Mark Schmidt, with modifications by Mike Gelbart. 1 Admin Assignment 4: Due Friday of next

More information

Littlestone s Dimension and Online Learnability

Littlestone s Dimension and Online Learnability Littlestone s Dimension and Online Learnability Shai Shalev-Shwartz Toyota Technological Institute at Chicago The Hebrew University Talk at UCSD workshop, February, 2009 Joint work with Shai Ben-David

More information

Classification: The PAC Learning Framework

Classification: The PAC Learning Framework Classification: The PAC Learning Framework Machine Learning: Jordan Boyd-Graber University of Colorado Boulder LECTURE 5 Slides adapted from Eli Upfal Machine Learning: Jordan Boyd-Graber Boulder Classification:

More information

Online Convex Optimization

Online Convex Optimization Advanced Course in Machine Learning Spring 2010 Online Convex Optimization Handouts are jointly prepared by Shie Mannor and Shai Shalev-Shwartz A convex repeated game is a two players game that is performed

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Jordan Boyd-Graber University of Colorado Boulder LECTURE 7 Slides adapted from Tom Mitchell, Eric Xing, and Lauren Hannah Jordan Boyd-Graber Boulder Support Vector Machines 1 of

More information

Empirical Risk Minimization Algorithms

Empirical Risk Minimization Algorithms Empirical Risk Minimization Algorithms Tirgul 2 Part I November 2016 Reminder Domain set, X : the set of objects that we wish to label. Label set, Y : the set of possible labels. A prediction rule, h:

More information

1 Review of the Perceptron Algorithm

1 Review of the Perceptron Algorithm COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #15 Scribe: (James) Zhen XIANG April, 008 1 Review of the Perceptron Algorithm In the last few lectures, we talked about various kinds

More information

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9 Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9 Slides adapted from Jordan Boyd-Graber Machine Learning: Chenhao Tan Boulder 1 of 39 Recap Supervised learning Previously: KNN, naïve

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline

More information

Lecture 16: Perceptron and Exponential Weights Algorithm

Lecture 16: Perceptron and Exponential Weights Algorithm EECS 598-005: Theoretical Foundations of Machine Learning Fall 2015 Lecture 16: Perceptron and Exponential Weights Algorithm Lecturer: Jacob Abernethy Scribes: Yue Wang, Editors: Weiqing Yu and Andrew

More information

THE WEIGHTED MAJORITY ALGORITHM

THE WEIGHTED MAJORITY ALGORITHM THE WEIGHTED MAJORITY ALGORITHM Csaba Szepesvári University of Alberta CMPUT 654 E-mail: szepesva@ualberta.ca UofA, October 3, 2006 OUTLINE 1 PREDICTION WITH EXPERT ADVICE 2 HALVING: FIND THE PERFECT EXPERT!

More information

Perceptron Mistake Bounds

Perceptron Mistake Bounds Perceptron Mistake Bounds Mehryar Mohri, and Afshin Rostamizadeh Google Research Courant Institute of Mathematical Sciences Abstract. We present a brief survey of existing mistake bounds and introduce

More information

Perceptron (Theory) + Linear Regression

Perceptron (Theory) + Linear Regression 10601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Perceptron (Theory) Linear Regression Matt Gormley Lecture 6 Feb. 5, 2018 1 Q&A

More information

1. Implement AdaBoost with boosting stumps and apply the algorithm to the. Solution:

1. Implement AdaBoost with boosting stumps and apply the algorithm to the. Solution: Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 October 31, 2016 Due: A. November 11, 2016; B. November 22, 2016 A. Boosting 1. Implement

More information

Warm up: risk prediction with logistic regression

Warm up: risk prediction with logistic regression Warm up: risk prediction with logistic regression Boss gives you a bunch of data on loans defaulting or not: {(x i,y i )} n i= x i 2 R d, y i 2 {, } You model the data as: P (Y = y x, w) = + exp( yw T

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning PAC Learning and VC Dimension Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE

More information

The Perceptron Algorithm

The Perceptron Algorithm The Perceptron Algorithm Machine Learning Spring 2018 The slides are mainly from Vivek Srikumar 1 Outline The Perceptron Algorithm Perceptron Mistake Bound Variants of Perceptron 2 Where are we? The Perceptron

More information

Classification: Logistic Regression from Data

Classification: Logistic Regression from Data Classification: Logistic Regression from Data Machine Learning: Jordan Boyd-Graber University of Colorado Boulder LECTURE 3 Slides adapted from Emily Fox Machine Learning: Jordan Boyd-Graber Boulder Classification:

More information

Lecture 25 of 42. PAC Learning, VC Dimension, and Mistake Bounds

Lecture 25 of 42. PAC Learning, VC Dimension, and Mistake Bounds Lecture 25 of 42 PAC Learning, VC Dimension, and Mistake Bounds Thursday, 15 March 2007 William H. Hsu, KSU http://www.kddresearch.org/courses/spring2007/cis732 Readings: Sections 7.4.17.4.3, 7.5.17.5.3,

More information

Support vector machines Lecture 4

Support vector machines Lecture 4 Support vector machines Lecture 4 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos Guestrin Q: What does the Perceptron mistake bound tell us? Theorem: The

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Machine Learning: Jordan Boyd-Graber University of Maryland SUPPORT VECTOR MACHINES Slides adapted from Tom Mitchell, Eric Xing, and Lauren Hannah Machine Learning: Jordan

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Slides adapted from Eli Upfal Machine Learning: Jordan Boyd-Graber University of Maryland FEATURE ENGINEERING Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine

More information

Machine Learning (CS 567) Lecture 2

Machine Learning (CS 567) Lecture 2 Machine Learning (CS 567) Lecture 2 Time: T-Th 5:00pm - 6:20pm Location: GFS118 Instructor: Sofus A. Macskassy (macskass@usc.edu) Office: SAL 216 Office hours: by appointment Teaching assistant: Cheol

More information

Outline: Ensemble Learning. Ensemble Learning. The Wisdom of Crowds. The Wisdom of Crowds - Really? Crowd wiser than any individual

Outline: Ensemble Learning. Ensemble Learning. The Wisdom of Crowds. The Wisdom of Crowds - Really? Crowd wiser than any individual Outline: Ensemble Learning We will describe and investigate algorithms to Ensemble Learning Lecture 10, DD2431 Machine Learning A. Maki, J. Sullivan October 2014 train weak classifiers/regressors and how

More information

Online Learning with Experts & Multiplicative Weights Algorithms

Online Learning with Experts & Multiplicative Weights Algorithms Online Learning with Experts & Multiplicative Weights Algorithms CS 159 lecture #2 Stephan Zheng April 1, 2016 Caltech Table of contents 1. Online Learning with Experts With a perfect expert Without perfect

More information

Name (NetID): (1 Point)

Name (NetID): (1 Point) CS446: Machine Learning (D) Spring 2017 March 16 th, 2017 This is a closed book exam. Everything you need in order to solve the problems is supplied in the body of this exam. This exam booklet contains

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Uppsala University Department of Linguistics and Philology Slides borrowed from Ryan McDonald, Google Research Machine Learning for NLP 1(50) Introduction Linear Classifiers Classifiers

More information

Clustering. Introduction to Data Science. Jordan Boyd-Graber and Michael Paul SLIDES ADAPTED FROM LAUREN HANNAH

Clustering. Introduction to Data Science. Jordan Boyd-Graber and Michael Paul SLIDES ADAPTED FROM LAUREN HANNAH Clustering Introduction to Data Science Jordan Boyd-Graber and Michael Paul SLIDES ADAPTED FROM LAUREN HANNAH Slides adapted from Tom Mitchell, Eric Xing, and Lauren Hannah Introduction to Data Science

More information

COMS 4771 Introduction to Machine Learning. Nakul Verma

COMS 4771 Introduction to Machine Learning. Nakul Verma COMS 4771 Introduction to Machine Learning Nakul Verma Announcements HW1 due next lecture Project details are available decide on the group and topic by Thursday Last time Generative vs. Discriminative

More information

The Perceptron Algorithm, Margins

The Perceptron Algorithm, Margins The Perceptron Algorithm, Margins MariaFlorina Balcan 08/29/2018 The Perceptron Algorithm Simple learning algorithm for supervised classification analyzed via geometric margins in the 50 s [Rosenblatt

More information

PAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018

PAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University PAC Learning Matt Gormley Lecture 14 March 5, 2018 1 ML Big Picture Learning Paradigms:

More information

Computational Learning Theory

Computational Learning Theory 09s1: COMP9417 Machine Learning and Data Mining Computational Learning Theory May 20, 2009 Acknowledgement: Material derived from slides for the book Machine Learning, Tom M. Mitchell, McGraw-Hill, 1997

More information

Linear and Logistic Regression. Dr. Xiaowei Huang

Linear and Logistic Regression. Dr. Xiaowei Huang Linear and Logistic Regression Dr. Xiaowei Huang https://cgi.csc.liv.ac.uk/~xiaowei/ Up to now, Two Classical Machine Learning Algorithms Decision tree learning K-nearest neighbor Model Evaluation Metrics

More information

Computational Learning Theory: Shattering and VC Dimensions. Machine Learning. Spring The slides are mainly from Vivek Srikumar

Computational Learning Theory: Shattering and VC Dimensions. Machine Learning. Spring The slides are mainly from Vivek Srikumar Computational Learning Theory: Shattering and VC Dimensions Machine Learning Spring 2018 The slides are mainly from Vivek Srikumar 1 This lecture: Computational Learning Theory The Theory of Generalization

More information

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) Linear smoother ŷ = S y where s ij = s ij (x) e.g. s ij = diag(l i (x)) 2 Online Learning: LMS and Perceptrons Partially adapted from slides by Ryan Gabbard and Mitch Marcus (and lots original slides by

More information

PAC-learning, VC Dimension and Margin-based Bounds

PAC-learning, VC Dimension and Margin-based Bounds More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based

More information

Generalization, Overfitting, and Model Selection

Generalization, Overfitting, and Model Selection Generalization, Overfitting, and Model Selection Sample Complexity Results for Supervised Classification Maria-Florina (Nina) Balcan 10/03/2016 Two Core Aspects of Machine Learning Algorithm Design. How

More information

Tutorial: PART 1. Online Convex Optimization, A Game- Theoretic Approach to Learning.

Tutorial: PART 1. Online Convex Optimization, A Game- Theoretic Approach to Learning. Tutorial: PART 1 Online Convex Optimization, A Game- Theoretic Approach to Learning http://www.cs.princeton.edu/~ehazan/tutorial/tutorial.htm Elad Hazan Princeton University Satyen Kale Yahoo Research

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

Midterm, Fall 2003

Midterm, Fall 2003 5-78 Midterm, Fall 2003 YOUR ANDREW USERID IN CAPITAL LETTERS: YOUR NAME: There are 9 questions. The ninth may be more time-consuming and is worth only three points, so do not attempt 9 unless you are

More information

Linear Classifiers. Michael Collins. January 18, 2012

Linear Classifiers. Michael Collins. January 18, 2012 Linear Classifiers Michael Collins January 18, 2012 Today s Lecture Binary classification problems Linear classifiers The perceptron algorithm Classification Problems: An Example Goal: build a system that

More information

Agnostic Online learnability

Agnostic Online learnability Technical Report TTIC-TR-2008-2 October 2008 Agnostic Online learnability Shai Shalev-Shwartz Toyota Technological Institute Chicago shai@tti-c.org ABSTRACT We study a fundamental question. What classes

More information

OLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research

OLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research OLSO Online Learning and Stochastic Optimization Yoram Singer August 10, 2016 Google Research References Introduction to Online Convex Optimization, Elad Hazan, Princeton University Online Learning and

More information

Active Learning and Optimized Information Gathering

Active Learning and Optimized Information Gathering Active Learning and Optimized Information Gathering Lecture 7 Learning Theory CS 101.2 Andreas Krause Announcements Project proposal: Due tomorrow 1/27 Homework 1: Due Thursday 1/29 Any time is ok. Office

More information

Learning Theory. Piyush Rai. CS5350/6350: Machine Learning. September 27, (CS5350/6350) Learning Theory September 27, / 14

Learning Theory. Piyush Rai. CS5350/6350: Machine Learning. September 27, (CS5350/6350) Learning Theory September 27, / 14 Learning Theory Piyush Rai CS5350/6350: Machine Learning September 27, 2011 (CS5350/6350) Learning Theory September 27, 2011 1 / 14 Why Learning Theory? We want to have theoretical guarantees about our

More information

Perceptron. Subhransu Maji. CMPSCI 689: Machine Learning. 3 February February 2015

Perceptron. Subhransu Maji. CMPSCI 689: Machine Learning. 3 February February 2015 Perceptron Subhransu Maji CMPSCI 689: Machine Learning 3 February 2015 5 February 2015 So far in the class Decision trees Inductive bias: use a combination of small number of features Nearest neighbor

More information

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines Lecture 9: Large Margin Classifiers. Linear Support Vector Machines Perceptrons Definition Perceptron learning rule Convergence Margin & max margin classifiers (Linear) support vector machines Formulation

More information

Logistic Regression. Machine Learning Fall 2018

Logistic Regression. Machine Learning Fall 2018 Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes

More information

Computational Learning Theory. Definitions

Computational Learning Theory. Definitions Computational Learning Theory Computational learning theory is interested in theoretical analyses of the following issues. What is needed to learn effectively? Sample complexity. How many examples? Computational

More information

Lecture 5 Neural models for NLP

Lecture 5 Neural models for NLP CS546: Machine Learning in NLP (Spring 2018) http://courses.engr.illinois.edu/cs546/ Lecture 5 Neural models for NLP Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Office hours: Tue/Thu 2pm-3pm

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Machine Learning: Jordan Boyd-Graber University of Maryland RADEMACHER COMPLEXITY Slides adapted from Rob Schapire Machine Learning: Jordan Boyd-Graber UMD Introduction

More information

Part of the slides are adapted from Ziko Kolter

Part of the slides are adapted from Ziko Kolter Part of the slides are adapted from Ziko Kolter OUTLINE 1 Supervised learning: classification........................................................ 2 2 Non-linear regression/classification, overfitting,

More information

CS 446: Machine Learning Lecture 4, Part 2: On-Line Learning

CS 446: Machine Learning Lecture 4, Part 2: On-Line Learning CS 446: Machine Learning Lecture 4, Part 2: On-Line Learning 0.1 Linear Functions So far, we have been looking at Linear Functions { as a class of functions which can 1 if W1 X separate some data and not

More information

Learning Theory Continued

Learning Theory Continued Learning Theory Continued Machine Learning CSE446 Carlos Guestrin University of Washington May 13, 2013 1 A simple setting n Classification N data points Finite number of possible hypothesis (e.g., dec.

More information

CSC321 Lecture 4 The Perceptron Algorithm

CSC321 Lecture 4 The Perceptron Algorithm CSC321 Lecture 4 The Perceptron Algorithm Roger Grosse and Nitish Srivastava January 17, 2017 Roger Grosse and Nitish Srivastava CSC321 Lecture 4 The Perceptron Algorithm January 17, 2017 1 / 1 Recap:

More information

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative

More information

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5 Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5 Slides adapted from Jordan Boyd-Graber, Tom Mitchell, Ziv Bar-Joseph Machine Learning: Chenhao Tan Boulder 1 of 27 Quiz question For

More information

Computational Learning Theory

Computational Learning Theory 0. Computational Learning Theory Based on Machine Learning, T. Mitchell, McGRAW Hill, 1997, ch. 7 Acknowledgement: The present slides are an adaptation of slides drawn by T. Mitchell 1. Main Questions

More information

Predicting Graph Labels using Perceptron. Shuang Song

Predicting Graph Labels using Perceptron. Shuang Song Predicting Graph Labels using Perceptron Shuang Song shs037@eng.ucsd.edu Online learning over graphs M. Herbster, M. Pontil, and L. Wainer, Proc. 22nd Int. Conf. Machine Learning (ICML'05), 2005 Prediction

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

CS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims

CS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims CS340 Machine learning Lecture 5 Learning theory cont'd Some slides are borrowed from Stuart Russell and Thorsten Joachims Inductive learning Simplest form: learn a function from examples f is the target

More information

Hypothesis Testing and Computational Learning Theory. EECS 349 Machine Learning With slides from Bryan Pardo, Tom Mitchell

Hypothesis Testing and Computational Learning Theory. EECS 349 Machine Learning With slides from Bryan Pardo, Tom Mitchell Hypothesis Testing and Computational Learning Theory EECS 349 Machine Learning With slides from Bryan Pardo, Tom Mitchell Overview Hypothesis Testing: How do we know our learners are good? What does performance

More information

Machine Learning (CSE 446): Learning as Minimizing Loss; Least Squares

Machine Learning (CSE 446): Learning as Minimizing Loss; Least Squares Machine Learning (CSE 446): Learning as Minimizing Loss; Least Squares Sham M Kakade c 2018 University of Washington cse446-staff@cs.washington.edu 1 / 13 Review 1 / 13 Alternate View of PCA: Minimizing

More information

Logistic Regression Logistic

Logistic Regression Logistic Case Study 1: Estimating Click Probabilities L2 Regularization for Logistic Regression Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 10 th,

More information

Logistic regression and linear classifiers COMS 4771

Logistic regression and linear classifiers COMS 4771 Logistic regression and linear classifiers COMS 4771 1. Prediction functions (again) Learning prediction functions IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random

More information

Online Learning and Sequential Decision Making

Online Learning and Sequential Decision Making Online Learning and Sequential Decision Making Emilie Kaufmann CNRS & CRIStAL, Inria SequeL, emilie.kaufmann@univ-lille.fr Research School, ENS Lyon, Novembre 12-13th 2018 Emilie Kaufmann Online Learning

More information

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #24 Scribe: Jad Bechara May 2, 2018

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #24 Scribe: Jad Bechara May 2, 2018 COS 5: heoretical Machine Learning Lecturer: Rob Schapire Lecture #24 Scribe: Jad Bechara May 2, 208 Review of Game heory he games we will discuss are two-player games that can be modeled by a game matrix

More information

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018 From Binary to Multiclass Classification CS 6961: Structured Prediction Spring 2018 1 So far: Binary Classification We have seen linear models Learning algorithms Perceptron SVM Logistic Regression Prediction

More information

VC Dimension Review. The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces.

VC Dimension Review. The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces. VC Dimension Review The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces. Previously, in discussing PAC learning, we were trying to answer questions about

More information

Computational Learning Theory (VC Dimension)

Computational Learning Theory (VC Dimension) Computational Learning Theory (VC Dimension) 1 Difficulty of machine learning problems 2 Capabilities of machine learning algorithms 1 Version Space with associated errors error is the true error, r is

More information

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18 CSE 417T: Introduction to Machine Learning Lecture 11: Review Henry Chai 10/02/18 Unknown Target Function!: # % Training data Formal Setup & = ( ), + ),, ( -, + - Learning Algorithm 2 Hypothesis Set H

More information

CS 188: Artificial Intelligence. Outline

CS 188: Artificial Intelligence. Outline CS 188: Artificial Intelligence Lecture 21: Perceptrons Pieter Abbeel UC Berkeley Many slides adapted from Dan Klein. Outline Generative vs. Discriminative Binary Linear Classifiers Perceptron Multi-class

More information

1 Learning Linear Separators

1 Learning Linear Separators 10-601 Machine Learning Maria-Florina Balcan Spring 2015 Plan: Perceptron algorithm for learning linear separators. 1 Learning Linear Separators Here we can think of examples as being from {0, 1} n or

More information

Game Theory, On-line prediction and Boosting (Freund, Schapire)

Game Theory, On-line prediction and Boosting (Freund, Schapire) Game heory, On-line prediction and Boosting (Freund, Schapire) Idan Attias /4/208 INRODUCION he purpose of this paper is to bring out the close connection between game theory, on-line prediction and boosting,

More information

CS340 Machine learning Lecture 4 Learning theory. Some slides are borrowed from Sebastian Thrun and Stuart Russell

CS340 Machine learning Lecture 4 Learning theory. Some slides are borrowed from Sebastian Thrun and Stuart Russell CS340 Machine learning Lecture 4 Learning theory Some slides are borrowed from Sebastian Thrun and Stuart Russell Announcement What: Workshop on applying for NSERC scholarships and for entry to graduate

More information

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley Natural Language Processing Classification Classification I Dan Klein UC Berkeley Classification Automatically make a decision about inputs Example: document category Example: image of digit digit Example:

More information

Efficient and Principled Online Classification Algorithms for Lifelon

Efficient and Principled Online Classification Algorithms for Lifelon Efficient and Principled Online Classification Algorithms for Lifelong Learning Toyota Technological Institute at Chicago Chicago, IL USA Talk @ Lifelong Learning for Mobile Robotics Applications Workshop,

More information

Online Learning Class 12, 20 March 2006 Andrea Caponnetto, Sanmay Das

Online Learning Class 12, 20 March 2006 Andrea Caponnetto, Sanmay Das Online Learning 9.520 Class 12, 20 March 2006 Andrea Caponnetto, Sanmay Das About this class Goal To introduce the general setting of online learning. To describe an online version of the RLS algorithm

More information

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16 600.463 Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16 25.1 Introduction Today we re going to talk about machine learning, but from an

More information

CSC 411 Lecture 17: Support Vector Machine

CSC 411 Lecture 17: Support Vector Machine CSC 411 Lecture 17: Support Vector Machine Ethan Fetaya, James Lucas and Emad Andrews University of Toronto CSC411 Lec17 1 / 1 Today Max-margin classification SVM Hard SVM Duality Soft SVM CSC411 Lec17

More information

Optimization and Gradient Descent

Optimization and Gradient Descent Optimization and Gradient Descent INFO-4604, Applied Machine Learning University of Colorado Boulder September 12, 2017 Prof. Michael Paul Prediction Functions Remember: a prediction function is the function

More information

Learning: Binary Perceptron. Examples: Perceptron. Separable Case. In the space of feature vectors

Learning: Binary Perceptron. Examples: Perceptron. Separable Case. In the space of feature vectors Linear Classifiers CS 88 Artificial Intelligence Perceptrons and Logistic Regression Pieter Abbeel & Dan Klein University of California, Berkeley Feature Vectors Some (Simplified) Biology Very loose inspiration

More information

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015 Machine Learning 10-701, Fall 2015 VC Dimension and Model Complexity Eric Xing Lecture 16, November 3, 2015 Reading: Chap. 7 T.M book, and outline material Eric Xing @ CMU, 2006-2015 1 Last time: PAC and

More information

The No-Regret Framework for Online Learning

The No-Regret Framework for Online Learning The No-Regret Framework for Online Learning A Tutorial Introduction Nahum Shimkin Technion Israel Institute of Technology Haifa, Israel Stochastic Processes in Engineering IIT Mumbai, March 2013 N. Shimkin,

More information

1 Overview. 2 Learning from Experts. 2.1 Defining a meaningful benchmark. AM 221: Advanced Optimization Spring 2016

1 Overview. 2 Learning from Experts. 2.1 Defining a meaningful benchmark. AM 221: Advanced Optimization Spring 2016 AM 1: Advanced Optimization Spring 016 Prof. Yaron Singer Lecture 11 March 3rd 1 Overview In this lecture we will introduce the notion of online convex optimization. This is an extremely useful framework

More information

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture 24 Scribe: Sachin Ravi May 2, 2013

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture 24 Scribe: Sachin Ravi May 2, 2013 COS 5: heoretical Machine Learning Lecturer: Rob Schapire Lecture 24 Scribe: Sachin Ravi May 2, 203 Review of Zero-Sum Games At the end of last lecture, we discussed a model for two player games (call

More information

Logistic Regression. COMP 527 Danushka Bollegala

Logistic Regression. COMP 527 Danushka Bollegala Logistic Regression COMP 527 Danushka Bollegala Binary Classification Given an instance x we must classify it to either positive (1) or negative (0) class We can use {1,-1} instead of {1,0} but we will

More information

Classification II: Decision Trees and SVMs

Classification II: Decision Trees and SVMs Classification II: Decision Trees and SVMs Digging into Data: Jordan Boyd-Graber February 25, 2013 Slides adapted from Tom Mitchell, Eric Xing, and Lauren Hannah Digging into Data: Jordan Boyd-Graber ()

More information

4.1 Online Convex Optimization

4.1 Online Convex Optimization CS/CNS/EE 53: Advanced Topics in Machine Learning Topic: Online Convex Optimization and Online SVM Lecturer: Daniel Golovin Scribe: Xiaodi Hou Date: Jan 3, 4. Online Convex Optimization Definition 4..

More information

Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron)

Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron) Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron) Intro to NLP, CS585, Fall 2014 http://people.cs.umass.edu/~brenocon/inlp2014/ Brendan O Connor (http://brenocon.com) 1 Models for

More information

ML4NLP Multiclass Classification

ML4NLP Multiclass Classification ML4NLP Multiclass Classification CS 590NLP Dan Goldwasser Purdue University dgoldwas@purdue.edu Social NLP Last week we discussed the speed-dates paper. Interesting perspective on NLP problems- Can we

More information

Machine Learning Theory (CS 6783)

Machine Learning Theory (CS 6783) Machine Learning Theory (CS 6783) Tu-Th 1:25 to 2:40 PM Kimball, B-11 Instructor : Karthik Sridharan ABOUT THE COURSE No exams! 5 assignments that count towards your grades (55%) One term project (40%)

More information

Training the linear classifier

Training the linear classifier 215, Training the linear classifier A natural way to train the classifier is to minimize the number of classification errors on the training data, i.e. choosing w so that the training error is minimized.

More information

More about the Perceptron

More about the Perceptron More about the Perceptron CMSC 422 MARINE CARPUAT marine@cs.umd.edu Credit: figures by Piyush Rai and Hal Daume III Recap: Perceptron for binary classification Classifier = hyperplane that separates positive

More information