Logistic regression and linear classifiers COMS 4771

Similar documents
Decision trees COMS 4771

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Linear regression COMS 4771

The Perceptron algorithm

Minimax risk bounds for linear threshold functions

Linear & nonlinear classifiers

Linear Classifiers and the Perceptron

1 Machine Learning Concepts (16 points)

Machine Learning 2017

COMS F18 Homework 3 (due October 29, 2018)

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Overfitting, Bias / Variance Analysis

CMU-Q Lecture 24:

Logistic Regression. Machine Learning Fall 2018

Regression with Numerical Optimization. Logistic

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Linear Classifiers. Michael Collins. January 18, 2012

Warm up: risk prediction with logistic regression

Recap from previous lecture

Classification objectives COMS 4771

Training the linear classifier

Statistical Data Mining and Machine Learning Hilary Term 2016

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Linear Models for Classification

Cheng Soon Ong & Christian Walder. Canberra February June 2018

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Machine Learning for NLP

Bias-Variance Tradeoff

Machine Learning for NLP

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Generative Learning algorithms

1 Learning Linear Separators

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

CS229 Supplemental Lecture notes

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

Support vector machines Lecture 4

Machine Learning, Fall 2012 Homework 2

The Perceptron Algorithm 1

Empirical Risk Minimization Algorithms

1 Learning Linear Separators

Max Margin-Classifier

Generative v. Discriminative classifiers Intuition

Ad Placement Strategies

Least Squares Regression

Is the test error unbiased for these programs?

Machine Learning Practice Page 2 of 2 10/28/13

Support Vector Machines

More about the Perceptron

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Introduction to Machine Learning

Linear discriminant functions

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Lecture 4: Linear predictors and the Perceptron

Linear & nonlinear classifiers

Linear Methods for Prediction

Linear Discrimination Functions

Statistical Machine Learning Hilary Term 2018

Bayesian Methods: Naïve Bayes

Linear Models in Machine Learning

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Linear Regression (continued)

Least Squares Regression

Empirical Risk Minimization

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

COMS 4771 Introduction to Machine Learning. Nakul Verma

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

Discriminative Models

EE 511 Online Learning Perceptron

Final Exam. December 11 th, This exam booklet contains five problems, out of which you are expected to answer four problems of your choice.

Neural networks COMS 4771

Machine Learning Lecture 7

CSCI-567: Machine Learning (Spring 2019)

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Introduction to Machine Learning

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer

a b = a T b = a i b i (1) i=1 (Geometric definition) The dot product of two Euclidean vectors a and b is defined by a b = a b cos(θ a,b ) (2)

Convex optimization COMS 4771

10-701/ Machine Learning - Midterm Exam, Fall 2010

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Discriminative Models

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Notes on Discriminant Functions and Optimal Classification

Generative v. Discriminative classifiers Intuition

Machine Learning and Data Mining. Linear classification. Kalev Kask

6.867 Machine learning: lecture 2. Tommi S. Jaakkola MIT CSAIL

Gaussian discriminant analysis Naive Bayes

Logistic Regression Logistic

The Perceptron Algorithm, Margins

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

Lecture 16: Modern Classification (I) - Separating Hyperplanes

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Advanced Introduction to Machine Learning CMU-10715

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Ch 4. Linear Models for Classification

Linear Methods for Classification

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9

Learning Methods for Linear Detectors

Transcription:

Logistic regression and linear classifiers COMS 4771

1. Prediction functions (again)

Learning prediction functions IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random pairs (i.e., labeled examples). 1 / 33

Learning prediction functions IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random pairs (i.e., labeled examples). X takes values in X. E.g., X = R d. Y takes values in Y. E.g., (regression problems) Y = R; (classification problems) Y = {1,..., K} or Y = {0, 1} or Y = { 1, +1}. 1 / 33

Learning prediction functions IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random pairs (i.e., labeled examples). X takes values in X. E.g., X = R d. Y takes values in Y. E.g., (regression problems) Y = R; (classification problems) Y = {1,..., K} or Y = {0, 1} or Y = { 1, +1}. 1. We observe (X 1, Y 1),..., (X n, Y n), and the choose a prediction function (i.e., predictor) ˆf : X Y, This is called learning or training. 1 / 33

Learning prediction functions IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random pairs (i.e., labeled examples). X takes values in X. E.g., X = R d. Y takes values in Y. E.g., (regression problems) Y = R; (classification problems) Y = {1,..., K} or Y = {0, 1} or Y = { 1, +1}. 1. We observe (X 1, Y 1),..., (X n, Y n), and the choose a prediction function (i.e., predictor) ˆf : X Y, This is called learning or training. 2. At prediction time, observe X, and form prediction ˆf(X). 1 / 33

Learning prediction functions IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random pairs (i.e., labeled examples). X takes values in X. E.g., X = R d. Y takes values in Y. E.g., (regression problems) Y = R; (classification problems) Y = {1,..., K} or Y = {0, 1} or Y = { 1, +1}. 1. We observe (X 1, Y 1),..., (X n, Y n), and the choose a prediction function (i.e., predictor) ˆf : X Y, This is called learning or training. 2. At prediction time, observe X, and form prediction ˆf(X). 3. Outcome is Y, and squared loss is ( ˆf(X) Y ) 2 (regression problems). zero-one loss is 1{ ˆf(X) Y } (classification problems). 1 / 33

Learning prediction functions IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random pairs (i.e., labeled examples). X takes values in X. E.g., X = R d. Y takes values in Y. E.g., (regression problems) Y = R; (classification problems) Y = {1,..., K} or Y = {0, 1} or Y = { 1, +1}. 1. We observe (X 1, Y 1),..., (X n, Y n), and the choose a prediction function (i.e., predictor) ˆf : X Y, This is called learning or training. 2. At prediction time, observe X, and form prediction ˆf(X). 3. Outcome is Y, and squared loss is ( ˆf(X) Y ) 2 (regression problems). zero-one loss is 1{ ˆf(X) Y } (classification problems). Note: expected zero-one loss is which we also call error rate. E[1{ ˆf(X) Y }] = P( ˆf(X) Y ), 1 / 33

Distributions over labeled examples X : space of possible side-information (feature space). Y: space of possible outcomes (label space or output space). 2 / 33

Distributions over labeled examples X : space of possible side-information (feature space). Y: space of possible outcomes (label space or output space). Distribution P of random pair (X, Y ) taking values in X Y can be thought of in two parts: 2 / 33

Distributions over labeled examples X : space of possible side-information (feature space). Y: space of possible outcomes (label space or output space). Distribution P of random pair (X, Y ) taking values in X Y can be thought of in two parts: 1. Marginal distribution P X of X: P X is a probability distribution on X. 2 / 33

Distributions over labeled examples X : space of possible side-information (feature space). Y: space of possible outcomes (label space or output space). Distribution P of random pair (X, Y ) taking values in X Y can be thought of in two parts: 1. Marginal distribution P X of X: P X is a probability distribution on X. 2. Conditional distribution P Y X=x of Y given X = x, for each x X : P Y X=x is a probability distribution on Y. 2 / 33

Optimal classifier For binary classification, what function f : X {0, 1} has smallest risk (i.e., error rate) R(f) := P(f(X) Y )? 3 / 33

Optimal classifier For binary classification, what function f : X {0, 1} has smallest risk (i.e., error rate) R(f) := P(f(X) Y )? Conditional on X = x, the minimizer of conditional risk ŷ P(ŷ Y X = x) is ŷ := { 1 if P(Y = 1 X = x) > 1/2, 0 if P(Y = 1 X = x) 1/2. 3 / 33

Optimal classifier For binary classification, what function f : X {0, 1} has smallest risk (i.e., error rate) R(f) := P(f(X) Y )? Conditional on X = x, the minimizer of conditional risk is ŷ := ŷ P(ŷ Y X = x) { 1 if P(Y = 1 X = x) > 1/2, 0 if P(Y = 1 X = x) 1/2. Therefore, the function f : X {0, 1} where { f 1 if P(Y = 1 X = x) > 1/2, (x) = 0 if P(Y = 1 X = x) 1/2, has the smallest risk. x X, 3 / 33

Optimal classifier For binary classification, what function f : X {0, 1} has smallest risk (i.e., error rate) R(f) := P(f(X) Y )? Conditional on X = x, the minimizer of conditional risk is ŷ := ŷ P(ŷ Y X = x) { 1 if P(Y = 1 X = x) > 1/2, 0 if P(Y = 1 X = x) 1/2. Therefore, the function f : X {0, 1} where { f 1 if P(Y = 1 X = x) > 1/2, (x) = 0 if P(Y = 1 X = x) 1/2, has the smallest risk. f is called the Bayes (optimal) classifier. x X, 3 / 33

Optimal classifier For binary classification, what function f : X {0, 1} has smallest risk (i.e., error rate) R(f) := P(f(X) Y )? Conditional on X = x, the minimizer of conditional risk is ŷ := ŷ P(ŷ Y X = x) { 1 if P(Y = 1 X = x) > 1/2, 0 if P(Y = 1 X = x) 1/2. Therefore, the function f : X {0, 1} where { f 1 if P(Y = 1 X = x) > 1/2, (x) = 0 if P(Y = 1 X = x) 1/2, has the smallest risk. f is called the Bayes (optimal) classifier. For Y = {1,..., K}, f (x) = arg max P(Y = y X = x), x X. y Y x X, 3 / 33

2. Logistic regression

Logistic regression Suppose X = R d and Y = {0, 1}. A logistic regression model is a statistical model where the conditional probability function has a particular form: Y X = x Bern(η β (x)), x R d, with and η β (x) := logistic(x T β), x R d, logistic(z) := 1 ez = 1 + e z 1 + e, z R. z 4 / 33

Logistic regression Suppose X = R d and Y = {0, 1}. A logistic regression model is a statistical model where the conditional probability function has a particular form: Y X = x Bern(η β (x)), x R d, with and η β (x) := logistic(x T β), x R d, logistic(z) := 1 ez = 1 + e z 1 + e, z R. z 1 0.8 0.6 0.4 0.2 0-6 -4-2 0 2 4 6 4 / 33

Logistic regression Suppose X = R d and Y = {0, 1}. A logistic regression model is a statistical model where the conditional probability function has a particular form: Y X = x Bern(η β (x)), x R d, with and η β (x) := logistic(x T β), x R d, logistic(z) := 1 ez = 1 + e z 1 + e, z R. z 1 0.8 0.6 0.4 0.2 0-6 -4-2 0 2 4 6 Parameters: β = (β 1,..., β d ) R d. 4 / 33

Logistic regression Suppose X = R d and Y = {0, 1}. A logistic regression model is a statistical model where the conditional probability function has a particular form: Y X = x Bern(η β (x)), x R d, with and η β (x) := logistic(x T β), x R d, logistic(z) := 1 ez = 1 + e z 1 + e, z R. z 1 0.8 0.6 0.4 0.2 0-6 -4-2 0 2 4 6 Parameters: β = (β 1,..., β d ) R d. Conditional distribution of Y given X is Bernoulli; marginal distribution of X not specified. 4 / 33

MLE for logistic regression Log-likelihood of β in iid logistic regression model, given data (X i, Y i) = (x i, y i) for i = 1,..., n: ln n η β (x i) y ( i 1 η β (x ) 1 y i i) = i=1 n y i ln η β (x i) + (1 y i) ln(1 η β (x i)). i=1 5 / 33

MLE for logistic regression Log-likelihood of β in iid logistic regression model, given data (X i, Y i) = (x i, y i) for i = 1,..., n: ln n η β (x i) y ( i 1 η β (x ) 1 y i i) = i=1 No closed-form formula for MLE. n y i ln η β (x i) + (1 y i) ln(1 η β (x i)). i=1 5 / 33

MLE for logistic regression Log-likelihood of β in iid logistic regression model, given data (X i, Y i) = (x i, y i) for i = 1,..., n: ln n η β (x i) y ( i 1 η β (x ) 1 y i i) = i=1 No closed-form formula for MLE. n y i ln η β (x i) + (1 y i) ln(1 η β (x i)). i=1 Nevertheless, there are efficient algorithms that obtain an approximate maximizer of the log-likelihood function. E.g., Newton s method. 5 / 33

Log-odds function and classifier Equivalent way to characterize logistic regression model: The log-odds function, given by e xt β η β (x) log-odds β (x) = ln 1 η β (x) = ln 1 + e xt β 1 = xt β, 1 + e xt β is a linear function 1, parameterized by β R d. 1 Some authors allow affine function; we can get this using affine expansion. 6 / 33

Log-odds function and classifier Equivalent way to characterize logistic regression model: The log-odds function, given by e xt β η β (x) log-odds β (x) = ln 1 η β (x) = ln 1 + e xt β 1 = xt β, 1 + e xt β is a linear function 1, parameterized by β R d. Bayes optimal classifier f β : R d {0, 1} in logistic regression model: { 0 if x T β 0, f β (x) = 1 if x T β > 0. 1 Some authors allow affine function; we can get this using affine expansion. 6 / 33

Log-odds function and classifier Equivalent way to characterize logistic regression model: The log-odds function, given by e xt β η β (x) log-odds β (x) = ln 1 η β (x) = ln 1 + e xt β 1 = xt β, 1 + e xt β is a linear function 1, parameterized by β R d. Bayes optimal classifier f β : R d {0, 1} in logistic regression model: { 0 if x T β 0, f β (x) = 1 if x T β > 0. Such classifiers are called linear classifiers. 1 Some authors allow affine function; we can get this using affine expansion. 6 / 33

3. Origin story for logistic regression

Where does the logistic regression model come from? The following is one way the logistic regression model comes about (but not the only way). 7 / 33

Where does the logistic regression model come from? The following is one way the logistic regression model comes about (but not the only way). Consider the following generative model for (X, Y ) where Y Bern(π), X Y = y N(µ y, Σ). 7 / 33

Where does the logistic regression model come from? The following is one way the logistic regression model comes about (but not the only way). Consider the following generative model for (X, Y ) where Y Bern(π), X Y = y N(µ y, Σ). Parameters: π [0, 1], µ 0, µ 1 R d, Σ R d d sym. & pos. def. 7 / 33

Where does the logistic regression model come from? The following is one way the logistic regression model comes about (but not the only way). Consider the following generative model for (X, Y ) where Y Bern(π), X Y = y N(µ y, Σ). Parameters: π [0, 1], µ 0, µ 1 R d, Σ R d d sym. & pos. def. -2 0 2 4 0.15 0.1 x 0.05 0 Figure shows (unconditional) probability density function for X. P(ω 2)=.5 P(ω 1)=.5 R1 R2-2 0 2 4 7 / 33

Statistical model for conditional distribution Suppose we are given the following. p Y : probability mass function for Y. p X Y =y : conditional probability density function for X given Y = y. 8 / 33

Statistical model for conditional distribution Suppose we are given the following. p Y : probability mass function for Y. p X Y =y : conditional probability density function for X given Y = y. What is the conditional distribution of Y given X? 8 / 33

Statistical model for conditional distribution Suppose we are given the following. p Y : probability mass function for Y. p X Y =y : conditional probability density function for X given Y = y. What is the conditional distribution of Y given X? By Bayes rule: for any x R d, P(Y = y X = x) = py (y) p X Y =y(x) p X (x) (where p X is unconditional density for X). 8 / 33

Statistical model for conditional distribution Suppose we are given the following. p Y : probability mass function for Y. p X Y =y : conditional probability density function for X given Y = y. What is the conditional distribution of Y given X? By Bayes rule: for any x R d, P(Y = y X = x) = py (y) p X Y =y(x) p X (x) (where p X is unconditional density for X). Therefore, log-odds function is log-odds(x) = ln ( ) p Y (1) p Y (0) px Y =1(x). p X Y =0 (x) 8 / 33

Log-odds function for our toy model Log-odds function: log-odds(x) = ln ( ) ( ) py (1) px Y =1 (x) + ln. p Y (0) p X Y =0 (x) 9 / 33

Log-odds function for our toy model Log-odds function: log-odds(x) = ln ( ) ( ) py (1) px Y =1 (x) + ln. p Y (0) p X Y =0 (x) In our toy model, we have Y Bern(π) and X Y = y N(µ y, AA T ), so: 1 π log-odds(x) = ln 1 π + ln e 2 A 1 (x µ 1 ) 2 2 e 1 2 A 1 (x µ 0 ) 2 2 π = ln 1 π 1 2 A 1 (x µ 1 ) 2 2 + 1 2 A 1 (x µ 0 ) 2 2 π = ln 1 π 1 2 ( A 1 µ 1 2 2 A 1 µ 0 2 2) + (µ 1 µ 0 ) T (AA T ) 1 x. }{{}}{{} linear function of x constant doesn t depend on x 9 / 33

Log-odds function for our toy model Log-odds function: log-odds(x) = ln ( ) ( ) py (1) px Y =1 (x) + ln. p Y (0) p X Y =0 (x) In our toy model, we have Y Bern(π) and X Y = y N(µ y, AA T ), so: 1 π log-odds(x) = ln 1 π + ln e 2 A 1 (x µ 1 ) 2 2 e 1 2 A 1 (x µ 0 ) 2 2 π = ln 1 π 1 2 A 1 (x µ 1 ) 2 2 + 1 2 A 1 (x µ 0 ) 2 2 π = ln 1 π 1 2 ( A 1 µ 1 2 2 A 1 µ 0 2 2) + (µ 1 µ 0 ) T (AA T ) 1 x. }{{}}{{} linear function of x constant doesn t depend on x This is an affine function of x. 9 / 33

Log-odds function for our toy model Log-odds function: log-odds(x) = ln ( ) ( ) py (1) px Y =1 (x) + ln. p Y (0) p X Y =0 (x) In our toy model, we have Y Bern(π) and X Y = y N(µ y, AA T ), so: 1 π log-odds(x) = ln 1 π + ln e 2 A 1 (x µ 1 ) 2 2 e 1 2 A 1 (x µ 0 ) 2 2 π = ln 1 π 1 2 A 1 (x µ 1 ) 2 2 + 1 2 A 1 (x µ 0 ) 2 2 π = ln 1 π 1 2 ( A 1 µ 1 2 2 A 1 µ 0 2 2) + (µ 1 µ 0 ) T (AA T ) 1 x. }{{}}{{} linear function of x constant doesn t depend on x This is an affine function of x. Hence, the statistical model for Y X is a logistic regression model (with affine feature expansion). 9 / 33

Log-odds function for our toy model Log-odds function: log-odds(x) = ln ( ) ( ) py (1) px Y =1 (x) + ln. p Y (0) p X Y =0 (x) In our toy model, we have Y Bern(π) and X Y = y N(µ y, AA T ), so: 1 π log-odds(x) = ln 1 π + ln e 2 A 1 (x µ 1 ) 2 2 e 1 2 A 1 (x µ 0 ) 2 2 π = ln 1 π 1 2 A 1 (x µ 1 ) 2 2 + 1 2 A 1 (x µ 0 ) 2 2 π = ln 1 π 1 2 ( A 1 µ 1 2 2 A 1 µ 0 2 2) + (µ 1 µ 0 ) T (AA T ) 1 x. }{{}}{{} linear function of x constant doesn t depend on x This is an affine function of x. Hence, the statistical model for Y X is a logistic regression model (with affine feature expansion). Important: Logistic regression model forgets about p X Y =y! 9 / 33

4. Linear classifiers

Linear classifiers A linear classifier is specified by a weight vector w R d : { 1 if x T w 0, f w(x) := +1 if x T w > 0. (Will be notationally simpler to use { 1, +1} instead of {0, 1}.) 10 / 33

Linear classifiers A linear classifier is specified by a weight vector w R d : { 1 if x T w 0, f w(x) := +1 if x T w > 0. (Will be notationally simpler to use { 1, +1} instead of {0, 1}.) Interpretation: does a linear combination of input features exceed 0? For w = (w 1,..., w d ) and x = (x 1,..., x d ), x T w = d? w ix i > 0. i=1 10 / 33

Geometry of linear classifiers H x 2 w A hyperplane in R d is a linear subspace of dimension d 1. A R 2 -hyperplane is a line. A R 3 -hyperplane is a plane. As a linear subspace, a hyperplane always contains the origin. x 1 A hyperplane H can be specified by a (non-zero) normal vector w R d. The hyperplane with normal vector w is the set of points orthogonal to w: { } H = x R d : x T w = 0. 11 / 33

Classification with a hyperplane H w span{w} 12 / 33

Classification with a hyperplane H span{w} x w θ x 2 cos θ Projection of x onto span{w} (a line) has coordinate x 2 cos(θ) where xt w cos θ =. w 2 x 2 (Distance to hyperplane is x 2 cos(θ).) 12 / 33

Classification with a hyperplane H span{w} x w θ x 2 cos θ Projection of x onto span{w} (a line) has coordinate x 2 cos(θ) where xt w cos θ =. w 2 x 2 (Distance to hyperplane is x 2 cos(θ).) Decision boundary is hyperplane (oriented by w): x T w > 0 x 2 cos(θ) > 0 x on same side of H as w 12 / 33

Classification with a hyperplane H span{w} x w θ x 2 cos θ Projection of x onto span{w} (a line) has coordinate x 2 cos(θ) where xt w cos θ =. w 2 x 2 (Distance to hyperplane is x 2 cos(θ).) Decision boundary is hyperplane (oriented by w): x T w > 0 x 2 cos(θ) > 0 x on same side of H as w What should we do if we want hyperplane decision boundary that doesn t (necessarily) go through origin? 12 / 33

Decision boundary with quadratic feature expansion elliptical decision boundary hyperbolic decision boundary 13 / 33

Decision boundary with quadratic feature expansion elliptical decision boundary hyperbolic decision boundary Same feature expansions we saw for linear regression models can also be used here to upgrade linear classifiers. 13 / 33

Text features How can we get features for text? 14 / 33

Text features How can we get features for text? Classifying words (input = a sequence of characters) x starts with anti = 1{starts with anti } 14 / 33

Text features How can we get features for text? Classifying words (input = a sequence of characters) x starts with anti = 1{starts with anti } x ends with logy = 1{ends with logy } 14 / 33

Text features How can we get features for text? Classifying words (input = a sequence of characters) x starts with anti = 1{starts with anti } x ends with logy = 1{ends with logy }... (same for all four-letter prefixes/suffixes) 14 / 33

Text features How can we get features for text? Classifying words (input = a sequence of characters) x starts with anti = 1{starts with anti } x ends with logy = 1{ends with logy }... (same for all four-letter prefixes/suffixes) x length 3 = 1{length 3} 14 / 33

Text features How can we get features for text? Classifying words (input = a sequence of characters) x starts with anti = 1{starts with anti } x ends with logy = 1{ends with logy }... (same for all four-letter prefixes/suffixes) x length 3 = 1{length 3} x length 4 = 1{length 4} 14 / 33

Text features How can we get features for text? Classifying words (input = a sequence of characters) x starts with anti = 1{starts with anti } x ends with logy = 1{ends with logy }... (same for all four-letter prefixes/suffixes) x length 3 = 1{length 3} x length 4 = 1{length 4}... (same for all positive integers up to 20) 14 / 33

Text features How can we get features for text? Classifying words (input = a sequence of characters) x starts with anti = 1{starts with anti } x ends with logy = 1{ends with logy }... (same for all four-letter prefixes/suffixes) x length 3 = 1{length 3} x length 4 = 1{length 4}... (same for all positive integers up to 20) Classifying documents (input = a sequence of words) 14 / 33

Text features How can we get features for text? Classifying words (input = a sequence of characters) x starts with anti = 1{starts with anti } x ends with logy = 1{ends with logy }... (same for all four-letter prefixes/suffixes) x length 3 = 1{length 3} x length 4 = 1{length 4}... (same for all positive integers up to 20) Classifying documents (input = a sequence of words) x contains aardvark = 1{contains aardvark } 14 / 33

Text features How can we get features for text? Classifying words (input = a sequence of characters) x starts with anti = 1{starts with anti } x ends with logy = 1{ends with logy }... (same for all four-letter prefixes/suffixes) x length 3 = 1{length 3} x length 4 = 1{length 4}... (same for all positive integers up to 20) Classifying documents (input = a sequence of words) x contains aardvark = 1{contains aardvark }... (same for all words in dictionary) 14 / 33

Text features How can we get features for text? Classifying words (input = a sequence of characters) x starts with anti = 1{starts with anti } x ends with logy = 1{ends with logy }... (same for all four-letter prefixes/suffixes) x length 3 = 1{length 3} x length 4 = 1{length 4}... (same for all positive integers up to 20) Classifying documents (input = a sequence of words) x contains aardvark = 1{contains aardvark }... (same for all words in dictionary) x contains each day = 1{contains each day } 14 / 33

Text features How can we get features for text? Classifying words (input = a sequence of characters) x starts with anti = 1{starts with anti } x ends with logy = 1{ends with logy }... (same for all four-letter prefixes/suffixes) x length 3 = 1{length 3} x length 4 = 1{length 4}... (same for all positive integers up to 20) Classifying documents (input = a sequence of words) x contains aardvark = 1{contains aardvark }... (same for all words in dictionary) x contains each day = 1{contains each day }... (same for all bi-grams of words in dictionary) Etc. Typically end up with lots of features! 14 / 33

Sparse representations If x has few non-zero entries, then better to use sparse representation: 15 / 33

Sparse representations If x has few non-zero entries, then better to use sparse representation: E.g., see spot run. Sparse representation (e.g., using hash table): x = { "contains_see":1, "contains_spot":1, "contains_run":1, "contains_see_spot":1, "contains_spot_run":1 } 15 / 33

Sparse representations If x has few non-zero entries, then better to use sparse representation: E.g., see spot run. Sparse representation (e.g., using hash table): x = { "contains_see":1, "contains_spot":1, "contains_run":1, "contains_see_spot":1, "contains_spot_run":1 } Compare to dense representation (using array): x = [ 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,... ] 15 / 33

Sparse representations If x has few non-zero entries, then better to use sparse representation: E.g., see spot run. Sparse representation (e.g., using hash table): x = { "contains_see":1, "contains_spot":1, "contains_run":1, "contains_see_spot":1, "contains_spot_run":1 } Compare to dense representation (using array): x = [ 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,... ] To compute inner product w T x: sum = 0. for feature in x: sum = sum + w[feature] * x[feature] Running time number of non-zero entries of x. 15 / 33

Sparse representations If x has few non-zero entries, then better to use sparse representation: E.g., see spot run. Sparse representation (e.g., using hash table): x = { "contains_see":1, "contains_spot":1, "contains_run":1, "contains_see_spot":1, "contains_spot_run":1 } Compare to dense representation (using array): x = [ 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,... ] To compute inner product w T x: sum = 0. for feature in x: sum = sum + w[feature] * x[feature] Running time number of non-zero entries of x. If x s are sparse, would you expect there to be a good sparse linear classifier? 15 / 33

5. Learning linear classifiers

Learning linear classifiers Even if Bayes optimal classifier is not linear (in our chosen feature space), we can hope that it has a good linear approximation. 16 / 33

Learning linear classifiers Even if Bayes optimal classifier is not linear (in our chosen feature space), we can hope that it has a good linear approximation. Goal: learn ŵ R d using iid sample (X 1, Y 1),..., (X n, Y n) such that Excess risk(fŵ) := R(fŵ) }{{} risk of your classifier min R(fw) w R } d {{} risk of best linear classifier is as small as possible, where the loss function is zero-one loss. 16 / 33

Learning linear classifiers Even if Bayes optimal classifier is not linear (in our chosen feature space), we can hope that it has a good linear approximation. Goal: learn ŵ R d using iid sample (X 1, Y 1),..., (X n, Y n) such that Excess risk(fŵ) := R(fŵ) }{{} risk of your classifier min R(fw) w R } d {{} risk of best linear classifier is as small as possible, where the loss function is zero-one loss. Empirical risk minimization (ERM): find f w with minimum empirical risk R(f w) := 1 n n 1{f w(x i) Y i}. i=1 16 / 33

Learning linear classifiers Even if Bayes optimal classifier is not linear (in our chosen feature space), we can hope that it has a good linear approximation. Goal: learn ŵ R d using iid sample (X 1, Y 1),..., (X n, Y n) such that Excess risk(fŵ) := R(fŵ) }{{} risk of your classifier min R(fw) w R } d {{} risk of best linear classifier is as small as possible, where the loss function is zero-one loss. Empirical risk minimization (ERM): find f w with minimum empirical risk R(f w) := 1 n Theorem: ERM solution fŵ satisfies as n at rate n 1{f w(x i) Y i}. i=1 E R(fŵ) min R(fw) w R d d n (and sometimes even faster!). 16 / 33

ERM for zero-one loss Unforunately, cannot compute (zero-one loss) ERM in general. 17 / 33

ERM for zero-one loss Unforunately, cannot compute (zero-one loss) ERM in general. The following problem is NP-hard (i.e., computationally intractable ): input n labeled examples from R d {0, 1} with promise that there is a linear classifier with empirical risk 0.01. output a linear classifier with empirical risk 0.49. (Zero-one loss is very different from squared loss!) 17 / 33

ERM for zero-one loss Unforunately, cannot compute (zero-one loss) ERM in general. The following problem is NP-hard (i.e., computationally intractable ): input n labeled examples from R d {0, 1} with promise that there is a linear classifier with empirical risk 0.01. output a linear classifier with empirical risk 0.49. (Zero-one loss is very different from squared loss!) Potential saving grace: Real-world problems we need to solve do not look like the encodings of difficult 3-SAT instances. 17 / 33

6. Linearly separable data and Perceptron

Linearly separable data Suppose there is a linear classifier that perfectly classifies the training examples S := ( (x 1, y 1),..., (x n, y n) ), i.e., for some w R d, f w (x) = y, for all (x, y) S. In this case, we say the training data is linearly separable. w 18 / 33

Finding a linear separator Problem: given training examples S from R d { 1, 1}, determine whether or not there exists w R d such that (and find such a vector if one exists). f w(x) = y, for all (x, y) S; 19 / 33

Finding a linear separator Problem: given training examples S from R d { 1, 1}, determine whether or not there exists w R d such that (and find such a vector if one exists). d variables: w R d S inequalities: for (x, y) S, f w(x) = y, for all (x, y) S; if y = 1: x T w 0, if y = +1: x T w > 0. Can be solved in polynomial time using algorithms for linear programming. 19 / 33

Finding a linear separator Problem: given training examples S from R d { 1, 1}, determine whether or not there exists w R d such that (and find such a vector if one exists). d variables: w R d S inequalities: for (x, y) S, f w(x) = y, for all (x, y) S; if y = 1: x T w 0, if y = +1: x T w > 0. Can be solved in polynomial time using algorithms for linear programming. If one exists, and the inequalities can be satisfied with some non-negligible wiggle room, then there is a very simple algorithm that finds a solution. 19 / 33

Perceptron (Rosenblatt, 1958) Perceptron input Labeled examples S from R d { 1, +1}. 1: initialize ŵ 1 := 0. 2: for t = 1, 2,..., do 3: if there is an example in S misclassified by fŵt then 4: Let (x t, y t ) be any such misclassified example from S. 5: Update: ŵ t+1 := ŵ t + y t x t. 6: else 7: return ŵ t. 8: end if 9: end for 20 / 33

Perceptron (Rosenblatt, 1958) Perceptron input Labeled examples S from R d { 1, +1}. 1: initialize ŵ 1 := 0. 2: for t = 1, 2,..., do 3: if there is an example in S misclassified by fŵt then 4: Let (x t, y t ) be any such misclassified example from S. 5: Update: ŵ t+1 := ŵ t + y t x t. 6: else 7: return ŵ t. 8: end if 9: end for Note 1: An example (x, y) is misclassified by f w if yx T w 0. 20 / 33

Perceptron (Rosenblatt, 1958) Perceptron input Labeled examples S from R d { 1, +1}. 1: initialize ŵ 1 := 0. 2: for t = 1, 2,..., do 3: if there is an example in S misclassified by fŵt then 4: Let (x t, y t ) be any such misclassified example from S. 5: Update: ŵ t+1 := ŵ t + y t x t. 6: else 7: return ŵ t. 8: end if 9: end for Note 1: An example (x, y) is misclassified by f w if yx T w 0. Note 2: If Perceptron terminates, then fŵt perfectly classifies the data! 20 / 33

Perceptron demo Scenario 1 x t ŵ t y t = 1 x T tŵt 0 Current vector ŵ t comparble to x t in length. 21 / 33

Perceptron demo Scenario 1 ŵ t+1 x t ŵ t y t = 1 x T tŵt 0 Updated vector ŵ t+1 now correctly classifies (x t, y t). 21 / 33

Perceptron demo Scenario 2 ŵ t x t y t = 1 x T tŵt 0 Current vector ŵ t much longer than x t. 21 / 33

Perceptron demo Scenario 2 ŵ t+1 ŵ t x y t = 1 t x T tŵt 0 Updated vector ŵ t+1 does not correctly classify (x t, y t). 21 / 33

Perceptron demo Scenario 2 ŵ t+1 ŵ t x y t = 1 t x T tŵt 0 Updated vector ŵ t+1 does not correctly classify (x t, y t). Not obvious that Perceptron will eventually terminate! 21 / 33

When is there a lot of wiggle room? Suppose w R d satisfies min yx T w > 0. (x,y) S 22 / 33

When is there a lot of wiggle room? Suppose w R d satisfies Then so does, e.g., w/100. min yx T w > 0. (x,y) S 22 / 33

When is there a lot of wiggle room? Suppose w R d satisfies min yx T w > 0. (x,y) S Then so does, e.g., w/100. Let s fix a particular scaling of w. 22 / 33

When is there a lot of wiggle room? Suppose w R d satisfies min yx T w > 0. (x,y) S Then so does, e.g., w/100. Let s fix a particular scaling of w. Let ( x, ỹ) be any example in S that achieves the minimum. 22 / 33

When is there a lot of wiggle room? Suppose w R d satisfies min yx T w > 0. (x,y) S Then so does, e.g., w/100. Let s fix a particular scaling of w. Let ( x, ỹ) be any example in S that achieves the minimum. H ỹ x T w w 2 ỹ x w 22 / 33

When is there a lot of wiggle room? Suppose w R d satisfies min yx T w > 0. (x,y) S Then so does, e.g., w/100. Let s fix a particular scaling of w. Let ( x, ỹ) be any example in S that achieves the minimum. Rescale w so that ỹ x T w = 1. H ỹ x T w w 2 ỹ x (Now scaling is fixed.) w 22 / 33

When is there a lot of wiggle room? Suppose w R d satisfies min yx T w > 0. (x,y) S Then so does, e.g., w/100. Let s fix a particular scaling of w. Let ( x, ỹ) be any example in S that achieves the minimum. Rescale w so that ỹ x T w = 1. H ỹ x T w w 2 ỹ x w (Now scaling is fixed.) 1 Now distance from ỹ x to H is. w 2 This distance is called the margin. 22 / 33

When is there a lot of wiggle room? Suppose w R d satisfies min yx T w > 0. (x,y) S Then so does, e.g., w/100. Let s fix a particular scaling of w. Let ( x, ỹ) be any example in S that achieves the minimum. Rescale w so that ỹ x T w = 1. H ỹ x T w w 2 ỹ x w (Now scaling is fixed.) 1 Now distance from ỹ x to H is. w 2 This distance is called the margin. Therefore, if w is a short vector satisfying min (x,y) S yxt w = 1, then it corresponds to a linear separator with large margin on examples in S. 22 / 33

When is there a lot of wiggle room? Suppose w R d satisfies min yx T w > 0. (x,y) S Then so does, e.g., w/100. Let s fix a particular scaling of w. Let ( x, ỹ) be any example in S that achieves the minimum. Rescale w so that ỹ x T w = 1. H ỹ x T w w 2 ỹ x w (Now scaling is fixed.) 1 Now distance from ỹ x to H is. w 2 This distance is called the margin. Therefore, if w is a short vector satisfying min (x,y) S yxt w = 1, then it corresponds to a linear separator with large margin on examples in S. Theorem : Perceptron converges quickly when there short w with min (x,y) S yx T w = 1. 22 / 33

Perceptron convergence theorem Theorem. Assume there exists w R d such that min (x,y) S yxt w = 1. Let L := max (x,y) S x 2. Then Perceptron halts after w 2 2L 2 iterations. 23 / 33

Perceptron convergence theorem Theorem. Assume there exists w R d such that min (x,y) S yxt w = 1. Let L := max (x,y) S x 2. Then Perceptron halts after w 2 2L 2 iterations. Proof: Suppose Perceptron does not exit the for-loop in iteration t. 23 / 33

Perceptron convergence theorem Theorem. Assume there exists w R d such that min (x,y) S yxt w = 1. Let L := max (x,y) S x 2. Then Perceptron halts after w 2 2L 2 iterations. Proof: Suppose Perceptron does not exit the for-loop in iteration t. Then there is a labeled example in S which we call (x t, y t) such that: y tw T x t 1 (by definition of w ) y tŵ T t x t 0 (since fŵt misclassifies the example) ŵ t+1 = ŵ t + y tx t (since an update is made) 23 / 33

Perceptron convergence theorem Theorem. Assume there exists w R d such that min (x,y) S yxt w = 1. Let L := max (x,y) S x 2. Then Perceptron halts after w 2 2L 2 iterations. Proof: Suppose Perceptron does not exit the for-loop in iteration t. Then there is a labeled example in S which we call (x t, y t) such that: y tw T x t 1 (by definition of w ) y tŵ T t x t 0 (since fŵt misclassifies the example) ŵ t+1 = ŵ t + y tx t (since an update is made) Use this information to (inductively) lower-bound cos (angle between w and ŵ t+1) = w T ŵ t+1 w 2 ŵ t+1 2. 23 / 33

Proof of Perceptron convergence theorem Goal: lower-bound w T ŵ t+1 w 2 ŵ t+1 2. 24 / 33

Proof of Perceptron convergence theorem Goal: lower-bound Numerator: w T ŵ t+1 w 2 ŵ t+1 2. w T ŵ t+1 = w T (ŵ t + y tx t) 24 / 33

Proof of Perceptron convergence theorem Goal: lower-bound Numerator: w T ŵ t+1 w 2 ŵ t+1 2. w T ŵ t+1 = w T (ŵ t + y tx t) = w T ŵ t + y tw T x t 24 / 33

Proof of Perceptron convergence theorem Goal: lower-bound Numerator: w T ŵ t+1 w 2 ŵ t+1 2. w T ŵ t+1 = w T (ŵ t + y tx t) = w T ŵ t + y tw T x t w T ŵ t + 1. 24 / 33

Proof of Perceptron convergence theorem Goal: lower-bound Numerator: w T ŵ t+1 w 2 ŵ t+1 2. w T ŵ t+1 = w T (ŵ t + y tx t) = w T ŵ t + y tw T x t w T ŵ t + 1. Therefore, by induction (with ŵ 1 = 0), w T ŵ t+1 t. 24 / 33

Proof of Perceptron convergence theorem Goal: lower-bound Numerator: w T ŵ t+1 w 2 ŵ t+1 2. w T ŵ t+1 = w T (ŵ t + y tx t) = w T ŵ t + y tw T x t w T ŵ t + 1. Therefore, by induction (with ŵ 1 = 0), w T ŵ t+1 t. Denominator: ŵ t+1 2 2 = ŵt + ytxt 2 2 24 / 33

Proof of Perceptron convergence theorem Goal: lower-bound Numerator: w T ŵ t+1 w 2 ŵ t+1 2. w T ŵ t+1 = w T (ŵ t + y tx t) = w T ŵ t + y tw T x t w T ŵ t + 1. Therefore, by induction (with ŵ 1 = 0), w T ŵ t+1 t. Denominator: ŵ t+1 2 2 = ŵt + ytxt 2 2 = ŵt 2 2 + 2ytŵT t x t + x t 2 2 24 / 33

Proof of Perceptron convergence theorem Goal: lower-bound Numerator: w T ŵ t+1 w 2 ŵ t+1 2. w T ŵ t+1 = w T (ŵ t + y tx t) = w T ŵ t + y tw T x t w T ŵ t + 1. Therefore, by induction (with ŵ 1 = 0), w T ŵ t+1 t. Denominator: ŵ t+1 2 2 = ŵt + ytxt 2 2 = ŵt 2 2 + 2ytŵT t x t + x t 2 2 ŵt 2 2 + L2. 24 / 33

Proof of Perceptron convergence theorem Goal: lower-bound Numerator: w T ŵ t+1 w 2 ŵ t+1 2. w T ŵ t+1 = w T (ŵ t + y tx t) = w T ŵ t + y tw T x t w T ŵ t + 1. Therefore, by induction (with ŵ 1 = 0), w T ŵ t+1 t. Denominator: ŵ t+1 2 2 = ŵt + ytxt 2 2 = ŵt 2 2 + 2ytŵT t x t + x t 2 2 ŵt 2 2 + L2. Therefore, by induction (with ŵ 1 = 0), ŵ t+1 2 2 L2 t. 24 / 33

Proof of Perceptron convergence theorem Goal: lower-bound Numerator: w T ŵ t+1 w 2 ŵ t+1 2. w T ŵ t+1 = w T (ŵ t + y tx t) = w T ŵ t + y tw T x t w T ŵ t + 1. Therefore, by induction (with ŵ 1 = 0), w T ŵ t+1 t. Denominator: ŵ t+1 2 2 = ŵt + ytxt 2 2 = ŵt 2 2 + 2ytŵT t x t + x t 2 2 ŵt 2 2 + L2. Therefore, by induction (with ŵ 1 = 0), ŵ t+1 2 2 L2 t. w T cos (angle between w and ŵ t+1) = ŵ t+1 w 2 ŵ t+1 2 24 / 33

Proof of Perceptron convergence theorem Goal: lower-bound Numerator: w T ŵ t+1 w 2 ŵ t+1 2. w T ŵ t+1 = w T (ŵ t + y tx t) = w T ŵ t + y tw T x t w T ŵ t + 1. Therefore, by induction (with ŵ 1 = 0), w T ŵ t+1 t. Denominator: ŵ t+1 2 2 = ŵt + ytxt 2 2 = ŵt 2 2 + 2ytŵT t x t + x t 2 2 ŵt 2 2 + L2. Therefore, by induction (with ŵ 1 = 0), ŵ t+1 2 2 L2 t. cos (angle between w and ŵ t+1) = w T ŵ t+1 t w 2 ŵ t+1 2 w 2 L t. 24 / 33

Proof of Perceptron convergence theorem Goal: lower-bound Numerator: w T ŵ t+1 w 2 ŵ t+1 2. w T ŵ t+1 = w T (ŵ t + y tx t) = w T ŵ t + y tw T x t w T ŵ t + 1. Therefore, by induction (with ŵ 1 = 0), w T ŵ t+1 t. Denominator: ŵ t+1 2 2 = ŵt + ytxt 2 2 = ŵt 2 2 + 2ytŵT t x t + x t 2 2 ŵt 2 2 + L2. Therefore, by induction (with ŵ 1 = 0), ŵ t+1 2 2 L2 t. cos (angle between w and ŵ t+1) = w T ŵ t+1 t w 2 ŵ t+1 2 w 2 L t. Since cosine is at most one, we conclude t w 2 2 L2. 24 / 33

Sneak peek: maximum margin linear classifier We can obtain the linear separator with the largest margin by finding the shortest vector w such that min yx T w = 1. (x,y) S largest margin smaller margin 25 / 33

Sneak peek: maximum margin linear classifier We can obtain the linear separator with the largest margin by finding the shortest vector w such that min yx T w = 1. (x,y) S largest margin smaller margin This can be expressed as a mathematical optimization problem that is the basis of support vector machines. (More on this later.) 25 / 33

7. Online Perceptron

Online Perceptron If data is not linearly separable, Perceptron runs forever! 26 / 33

Online Perceptron If data is not linearly separable, Perceptron runs forever! Alternative: consider each example once, then stop. Online Perceptron input Labeled examples ((x i, y i )) n i=1 from Rd { 1, +1}. 1: initialize ŵ 1 := 0. 2: for t = 1, 2,..., n do 3: if y t x T tŵt 0 then 4: ŵ t+1 := ŵ t + y t x t. 5: else 6: ŵ t+1 := ŵ t 7: end if 8: end for 9: return ŵ n+1. 26 / 33

Online Perceptron If data is not linearly separable, Perceptron runs forever! Alternative: consider each example once, then stop. Online Perceptron input Labeled examples ((x i, y i )) n i=1 from Rd { 1, +1}. 1: initialize ŵ 1 := 0. 2: for t = 1, 2,..., n do 3: if y t x T tŵt 0 then 4: ŵ t+1 := ŵ t + y t x t. 5: else 6: ŵ t+1 := ŵ t 7: end if 8: end for 9: return ŵ n+1. Final classifier fŵn+1 is not necessarily a linear separator (even if one exists!). 26 / 33

Online learning Online learning algorithms: Go through examples (x 1, y 1), (x 2, y 2),... one-by-one. 27 / 33

Online learning Online learning algorithms: Go through examples (x 1, y 1), (x 2, y 2),... one-by-one. Before seeing (x t, y t), learner has a current classifier ˆf t in hand. 27 / 33

Online learning Online learning algorithms: Go through examples (x 1, y 1), (x 2, y 2),... one-by-one. Before seeing (x t, y t), learner has a current classifier ˆf t in hand. Upon seeing x t, learner makes a prediction: ŷ t := ˆf t(x t). 27 / 33

Online learning Online learning algorithms: Go through examples (x 1, y 1), (x 2, y 2),... one-by-one. Before seeing (x t, y t), learner has a current classifier ˆf t in hand. Upon seeing x t, learner makes a prediction: ŷ t := ˆf t(x t). If ŷ t y t, then the prediction was a mistake. 27 / 33

Online learning Online learning algorithms: Go through examples (x 1, y 1), (x 2, y 2),... one-by-one. Before seeing (x t, y t), learner has a current classifier ˆf t in hand. Upon seeing x t, learner makes a prediction: ŷ t := ˆf t(x t). If ŷ t y t, then the prediction was a mistake. See correct label y t; update ˆf t to get ˆf t+1. Typically, update is very computationally cheap to compute. 27 / 33

Online learning Online learning algorithms: Go through examples (x 1, y 1), (x 2, y 2),... one-by-one. Before seeing (x t, y t), learner has a current classifier ˆf t in hand. Upon seeing x t, learner makes a prediction: ŷ t := ˆf t(x t). If ŷ t y t, then the prediction was a mistake. See correct label y t; update ˆf t to get ˆf t+1. Typically, update is very computationally cheap to compute. Theorem : If there is a short w with min y t tx T t w 1, then Online Perceptron makes a small number of mistakes. 27 / 33

Online learning Online learning algorithms: Go through examples (x 1, y 1), (x 2, y 2),... one-by-one. Before seeing (x t, y t), learner has a current classifier ˆf t in hand. Upon seeing x t, learner makes a prediction: ŷ t := ˆf t(x t). If ŷ t y t, then the prediction was a mistake. See correct label y t; update ˆf t to get ˆf t+1. Typically, update is very computationally cheap to compute. Theorem : If there is a short w with min y t tx T t w 1, then Online Perceptron makes a small number of mistakes. Theorem : If there is a short w that makes a small number of hinge loss mistakes, then Online Perceptron makes a small number of mistakes. (See reading assignment we ll discuss hinge loss in an upcoming lecture.) 27 / 33

What good is a small mistake bound? Sequence of classifiers ˆf 1, ˆf 2,... is accurate (on average) in predicting labels of iid sequence (X 1, Y 1), (X 2, Y 2),... X 1 X 2 X 3 X 4 X 5 X 6 X 7 X 8 X 9 X 10 ˆf 1 ˆf2 ˆf3 ˆf4 ˆf5 ˆf6 ˆf7 ˆf8 ˆf9 ˆf10 ŷ 1 ŷ 2 ŷ 3 ŷ 4 ŷ 5 ŷ 6 ŷ 7 ŷ 8 ŷ 9 ŷ 10 Y 1 Y 2 Y 3 Y 4 Y 5 Y 6 Y 7 Y 8 Y 9 Y 10 28 / 33

What good is a small mistake bound? Sequence of classifiers ˆf 1, ˆf 2,... is accurate (on average) in predicting labels of iid sequence (X 1, Y 1), (X 2, Y 2),... Combine classifiers ˆf 1, ˆf 2,... into a single, accurate classifier ˆf. X 1 X 2 X 3 X 4 X 5 X 6 X 7 X 8 X 9 X 10 ˆf 1 ˆf2 ˆf3 ˆf4 ˆf5 ˆf6 ˆf7 ˆf8 ˆf9 ˆf10 ŷ 1 ŷ 2 ŷ 3 ŷ 4 ŷ 5 ŷ 6 ŷ 7 ŷ 8 ŷ 9 ŷ 10 Y 1 Y 2 Y 3 Y 4 Y 5 Y 6 Y 7 Y 8 Y 9 Y 10 ˆf 28 / 33

What good is a small mistake bound? Sequence of classifiers ˆf 1, ˆf 2,... is accurate (on average) in predicting labels of iid sequence (X 1, Y 1), (X 2, Y 2),... Combine classifiers ˆf 1, ˆf 2,... into a single, accurate classifier ˆf. X 1 X 2 X 3 X 4 X 5 X 6 X 7 X 8 X 9 X 10 ˆf 1 ˆf2 ˆf3 ˆf4 ˆf5 ˆf6 ˆf7 ˆf8 ˆf9 ˆf10 ŷ 1 ŷ 2 ŷ 3 ŷ 4 ŷ 5 ŷ 6 ŷ 7 ŷ 8 ŷ 9 ŷ 10 Y 1 Y 2 Y 3 Y 4 Y 5 Y 6 Y 7 Y 8 Y 9 Y 10 ˆf This is achieved via online-to-batch conversion. 28 / 33

Online-to-batch conversion Run online learning algorithm on sequence of examples (x 1, y 1),..., (x n, y n) (in random order) to produce sequence of binary classifiers ˆf 1,..., ˆf n+1, ( ˆf i : X {±1}). 29 / 33

Online-to-batch conversion Run online learning algorithm on sequence of examples (x 1, y 1),..., (x n, y n) (in random order) to produce sequence of binary classifiers ( ˆf i : X {±1}). ˆf 1,..., ˆf n+1, Final classifier: majority vote over the ˆf i s { 1 if n+1 i=1 ˆf(x) := ˆf i(x) 0, +1 if n+1 ˆf i=1 i(x) > 0. 29 / 33

Online-to-batch conversion Run online learning algorithm on sequence of examples (x 1, y 1),..., (x n, y n) (in random order) to produce sequence of binary classifiers ( ˆf i : X {±1}). ˆf 1,..., ˆf n+1, Final classifier: majority vote over the ˆf i s { 1 if n+1 i=1 ˆf(x) := ˆf i(x) 0, +1 if n+1 ˆf i=1 i(x) > 0. There are many variants of online-to-batch conversions that make sense! 29 / 33

Practical suggestions for online-to-batch variant 30 / 33

Practical suggestions for online-to-batch variant Run online learning algorithm to make a few (e.g., two) passes over training examples, each time in a different random order. Pass 1: (x 2, y 2 ), (x 9, y 9 ), (x 6, y 6 ), (x 1, y 1 ), (x 3, y 3 ), (x 4, y 4 ), (x 10, y 10 ), (x 8, y 8 ), (x 7, y 7 ), (x 5, y 5 ) Pass 2: (x 4, y 4 ), (x 5, y 5 ), (x 1, y 1 ), (x 2, y 2 ), (x 3, y 3 ), (x 10, y 10 ), (x 7, y 7 ), (x 9, y 9 ), (x 6, y 6 ), (x 8, y 8 ) (Total of 2n rounds) 30 / 33

Practical suggestions for online-to-batch variant Run online learning algorithm to make a few (e.g., two) passes over training examples, each time in a different random order. Pass 1: (x 2, y 2 ), (x 9, y 9 ), (x 6, y 6 ), (x 1, y 1 ), (x 3, y 3 ), (x 4, y 4 ), (x 10, y 10 ), (x 8, y 8 ), (x 7, y 7 ), (x 5, y 5 ) Pass 2: (x 4, y 4 ), (x 5, y 5 ), (x 1, y 1 ), (x 2, y 2 ), (x 3, y 3 ), (x 10, y 10 ), (x 7, y 7 ), (x 9, y 9 ), (x 6, y 6 ), (x 8, y 8 ) (Total of 2n rounds) For linear classifiers, instead of using majority vote to combine, just average weight vectors. ŵ := 2n+1 1 ŵ i. 2n + 1 i=1 30 / 33

Practical suggestions for online-to-batch variant Run online learning algorithm to make a few (e.g., two) passes over training examples, each time in a different random order. Pass 1: (x 2, y 2 ), (x 9, y 9 ), (x 6, y 6 ), (x 1, y 1 ), (x 3, y 3 ), (x 4, y 4 ), (x 10, y 10 ), (x 8, y 8 ), (x 7, y 7 ), (x 5, y 5 ) Pass 2: (x 4, y 4 ), (x 5, y 5 ), (x 1, y 1 ), (x 2, y 2 ), (x 3, y 3 ), (x 10, y 10 ), (x 7, y 7 ), (x 9, y 9 ), (x 6, y 6 ), (x 8, y 8 ) (Total of 2n rounds) For linear classifiers, instead of using majority vote to combine, just average weight vectors. ŵ := 2n+1 1 ŵ i. 2n + 1 i=1 Don t use weight vectors from first pass through data. ŵ := 1 n + 1 2n+1 i=n+1 (Weight vectors at start are rubbish anyway.) ŵ i. 30 / 33

Example: restaurant review classification Data: 1M reviews of Pittsburgh restaruants (from Yelp). Y = {at least 4 stars, below 4 stars} (66.2% have at least 4 (out of 5) stars.) Reviews represented as bags-of-words: e.g., x great = # times great appears in review. (d 200000 unique words.) 31 / 33

Example: restaurant review classification Data: 1M reviews of Pittsburgh restaruants (from Yelp). Y = {at least 4 stars, below 4 stars} (66.2% have at least 4 (out of 5) stars.) Reviews represented as bags-of-words: e.g., x great = # times great appears in review. (d 200000 unique words.) Results Online Perceptron + affine expansion + online-to-batch variant: Training risk: 10.1%. Test risk (on 320123 held-out test examples): 10.5%. Running time: 1.5 minutes on single Intel Xeon 2.30GHz CPU. (This omits time to load data, which was another 4 minutes from the raw text.) 31 / 33

Other approaches for learning linear (or affine) classifiers Naïve Bayes generative model MLE Gaussian generative model MLE (sometimes) Methods for linear regression (+ thresholding) 32 / 33

Other approaches for learning linear (or affine) classifiers Naïve Bayes generative model MLE Gaussian generative model MLE (sometimes) Methods for linear regression (+ thresholding) Logistic regression, linear programming, Perceptron, Online Perceptron 32 / 33

Other approaches for learning linear (or affine) classifiers Naïve Bayes generative model MLE Gaussian generative model MLE (sometimes) Methods for linear regression (+ thresholding) Logistic regression, linear programming, Perceptron, Online Perceptron Support vector machines, etc. (Next lecture.) 32 / 33

Key takeaways 1. Logistic regression model. 2. Structure/geometry of linear classifiers. 3. Empirical risk minimization for linear classifiers and intractability. 4. Linear separability; two approaches to find a linear separator. 5. Online Perceptron; online-to-batch conversion. 33 / 33