Logistic regression and linear classifiers COMS PDF Free Download

Logistic regression and linear classifiers COMS 4771

1. Prediction functions (again)

Learning prediction functions IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random pairs (i.e., labeled examples). 1 / 33

Learning prediction functions IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random pairs (i.e., labeled examples). X takes values in X. E.g., X = R d. Y takes values in Y. E.g., (regression problems) Y = R; (classification problems) Y = {1,..., K} or Y = {0, 1} or Y = { 1, +1}. 1. We observe (X 1, Y 1),..., (X n, Y n), and the choose a prediction function (i.e., predictor) ˆf : X Y, This is called learning or training. 2. At prediction time, observe X, and form prediction ˆf(X). 3. Outcome is Y, and squared loss is ( ˆf(X) Y ) 2 (regression problems). zero-one loss is 1{ ˆf(X) Y } (classification problems). 1 / 33

Distributions over labeled examples X : space of possible side-information (feature space). Y: space of possible outcomes (label space or output space). 2 / 33

Distributions over labeled examples X : space of possible side-information (feature space). Y: space of possible outcomes (label space or output space). Distribution P of random pair (X, Y ) taking values in X Y can be thought of in two parts: 1. Marginal distribution P X of X: P X is a probability distribution on X. 2 / 33

Optimal classifier For binary classification, what function f : X {0, 1} has smallest risk (i.e., error rate) R(f) := P(f(X) Y )? 3 / 33

Optimal classifier For binary classification, what function f : X {0, 1} has smallest risk (i.e., error rate) R(f) := P(f(X) Y )? Conditional on X = x, the minimizer of conditional risk is ŷ := ŷ P(ŷ Y X = x) { 1 if P(Y = 1 X = x) > 1/2, 0 if P(Y = 1 X = x) 1/2. Therefore, the function f : X {0, 1} where { f 1 if P(Y = 1 X = x) > 1/2, (x) = 0 if P(Y = 1 X = x) 1/2, has the smallest risk. x X, 3 / 33

2. Logistic regression

Logistic regression Suppose X = R d and Y = {0, 1}. A logistic regression model is a statistical model where the conditional probability function has a particular form: Y X = x Bern(η β (x)), x R d, with and η β (x) := logistic(x T β), x R d, logistic(z) := 1 ez = 1 + e z 1 + e, z R. z 1 0.8 0.6 0.4 0.2 0-6 -4-2 0 2 4 6 4 / 33

MLE for logistic regression Log-likelihood of β in iid logistic regression model, given data (X i, Y i) = (x i, y i) for i = 1,..., n: ln n η β (x i) y ( i 1 η β (x ) 1 y i i) = i=1 n y i ln η β (x i) + (1 y i) ln(1 η β (x i)). i=1 5 / 33

MLE for logistic regression Log-likelihood of β in iid logistic regression model, given data (X i, Y i) = (x i, y i) for i = 1,..., n: ln n η β (x i) y ( i 1 η β (x ) 1 y i i) = i=1 No closed-form formula for MLE. n y i ln η β (x i) + (1 y i) ln(1 η β (x i)). i=1 Nevertheless, there are efficient algorithms that obtain an approximate maximizer of the log-likelihood function. E.g., Newton s method. 5 / 33

Log-odds function and classifier Equivalent way to characterize logistic regression model: The log-odds function, given by e xt β η β (x) log-odds β (x) = ln 1 η β (x) = ln 1 + e xt β 1 = xt β, 1 + e xt β is a linear function 1, parameterized by β R d. Bayes optimal classifier f β : R d {0, 1} in logistic regression model: { 0 if x T β 0, f β (x) = 1 if x T β > 0. 1 Some authors allow affine function; we can get this using affine expansion. 6 / 33

Log-odds function and classifier Equivalent way to characterize logistic regression model: The log-odds function, given by e xt β η β (x) log-odds β (x) = ln 1 η β (x) = ln 1 + e xt β 1 = xt β, 1 + e xt β is a linear function 1, parameterized by β R d. Bayes optimal classifier f β : R d {0, 1} in logistic regression model: { 0 if x T β 0, f β (x) = 1 if x T β > 0. Such classifiers are called linear classifiers. 1 Some authors allow affine function; we can get this using affine expansion. 6 / 33

3. Origin story for logistic regression

Where does the logistic regression model come from? The following is one way the logistic regression model comes about (but not the only way). 7 / 33

Where does the logistic regression model come from? The following is one way the logistic regression model comes about (but not the only way). Consider the following generative model for (X, Y ) where Y Bern(π), X Y = y N(µ y, Σ). Parameters: π [0, 1], µ 0, µ 1 R d, Σ R d d sym. & pos. def. 7 / 33

Statistical model for conditional distribution Suppose we are given the following. p Y : probability mass function for Y. p X Y =y : conditional probability density function for X given Y = y. 8 / 33

Statistical model for conditional distribution Suppose we are given the following. p Y : probability mass function for Y. p X Y =y : conditional probability density function for X given Y = y. What is the conditional distribution of Y given X? By Bayes rule: for any x R d, P(Y = y X = x) = py (y) p X Y =y(x) p X (x) (where p X is unconditional density for X). 8 / 33

Log-odds function for our toy model Log-odds function: log-odds(x) = ln ( ) ( ) py (1) px Y =1 (x) + ln. p Y (0) p X Y =0 (x) 9 / 33

Log-odds function for our toy model Log-odds function: log-odds(x) = ln ( ) ( ) py (1) px Y =1 (x) + ln. p Y (0) p X Y =0 (x) In our toy model, we have Y Bern(π) and X Y = y N(µ y, AA T ), so: 1 π log-odds(x) = ln 1 π + ln e 2 A 1 (x µ 1 ) 2 2 e 1 2 A 1 (x µ 0 ) 2 2 π = ln 1 π 1 2 A 1 (x µ 1 ) 2 2 + 1 2 A 1 (x µ 0 ) 2 2 π = ln 1 π 1 2 ( A 1 µ 1 2 2 A 1 µ 0 2 2) + (µ 1 µ 0 ) T (AA T ) 1 x. }{{}}{{} linear function of x constant doesn t depend on x 9 / 33

4. Linear classifiers

Linear classifiers A linear classifier is specified by a weight vector w R d : { 1 if x T w 0, f w(x) := +1 if x T w > 0. (Will be notationally simpler to use { 1, +1} instead of {0, 1}.) 10 / 33

Linear classifiers A linear classifier is specified by a weight vector w R d : { 1 if x T w 0, f w(x) := +1 if x T w > 0. (Will be notationally simpler to use { 1, +1} instead of {0, 1}.) Interpretation: does a linear combination of input features exceed 0? For w = (w 1,..., w d ) and x = (x 1,..., x d ), x T w = d? w ix i > 0. i=1 10 / 33

Geometry of linear classifiers H x 2 w A hyperplane in R d is a linear subspace of dimension d 1. A R 2 -hyperplane is a line. A R 3 -hyperplane is a plane. As a linear subspace, a hyperplane always contains the origin. x 1 A hyperplane H can be specified by a (non-zero) normal vector w R d. The hyperplane with normal vector w is the set of points orthogonal to w: { } H = x R d : x T w = 0. 11 / 33

Classification with a hyperplane H w span{w} 12 / 33

Classification with a hyperplane H span{w} x w θ x 2 cos θ Projection of x onto span{w} (a line) has coordinate x 2 cos(θ) where xt w cos θ =. w 2 x 2 (Distance to hyperplane is x 2 cos(θ).) 12 / 33

Classification with a hyperplane H span{w} x w θ x 2 cos θ Projection of x onto span{w} (a line) has coordinate x 2 cos(θ) where xt w cos θ =. w 2 x 2 (Distance to hyperplane is x 2 cos(θ).) Decision boundary is hyperplane (oriented by w): x T w > 0 x 2 cos(θ) > 0 x on same side of H as w 12 / 33

Decision boundary with quadratic feature expansion elliptical decision boundary hyperbolic decision boundary 13 / 33

Decision boundary with quadratic feature expansion elliptical decision boundary hyperbolic decision boundary Same feature expansions we saw for linear regression models can also be used here to upgrade linear classifiers. 13 / 33

Text features How can we get features for text? 14 / 33

Text features How can we get features for text? Classifying words (input = a sequence of characters) x starts with anti = 1{starts with anti } 14 / 33

Text features How can we get features for text? Classifying words (input = a sequence of characters) x starts with anti = 1{starts with anti } x ends with logy = 1{ends with logy } 14 / 33

Text features How can we get features for text? Classifying words (input = a sequence of characters) x starts with anti = 1{starts with anti } x ends with logy = 1{ends with logy }... (same for all four-letter prefixes/suffixes) x length 3 = 1{length 3} x length 4 = 1{length 4}... (same for all positive integers up to 20) Classifying documents (input = a sequence of words) x contains aardvark = 1{contains aardvark } 14 / 33

Sparse representations If x has few non-zero entries, then better to use sparse representation: 15 / 33

Sparse representations If x has few non-zero entries, then better to use sparse representation: E.g., see spot run. Sparse representation (e.g., using hash table): x = { "contains_see":1, "contains_spot":1, "contains_run":1, "contains_see_spot":1, "contains_spot_run":1 } Compare to dense representation (using array): x = [ 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,... ] To compute inner product w T x: sum = 0. for feature in x: sum = sum + w[feature] * x[feature] Running time number of non-zero entries of x. 15 / 33

5. Learning linear classifiers

Learning linear classifiers Even if Bayes optimal classifier is not linear (in our chosen feature space), we can hope that it has a good linear approximation. 16 / 33

Learning linear classifiers Even if Bayes optimal classifier is not linear (in our chosen feature space), we can hope that it has a good linear approximation. Goal: learn ŵ R d using iid sample (X 1, Y 1),..., (X n, Y n) such that Excess risk(fŵ) := R(fŵ) }{{} risk of your classifier min R(fw) w R } d {{} risk of best linear classifier is as small as possible, where the loss function is zero-one loss. Empirical risk minimization (ERM): find f w with minimum empirical risk R(f w) := 1 n n 1{f w(x i) Y i}. i=1 16 / 33

ERM for zero-one loss Unforunately, cannot compute (zero-one loss) ERM in general. 17 / 33

ERM for zero-one loss Unforunately, cannot compute (zero-one loss) ERM in general. The following problem is NP-hard (i.e., computationally intractable ): input n labeled examples from R d {0, 1} with promise that there is a linear classifier with empirical risk 0.01. output a linear classifier with empirical risk 0.49. (Zero-one loss is very different from squared loss!) 17 / 33

6. Linearly separable data and Perceptron

Linearly separable data Suppose there is a linear classifier that perfectly classifies the training examples S := ( (x 1, y 1),..., (x n, y n) ), i.e., for some w R d, f w (x) = y, for all (x, y) S. In this case, we say the training data is linearly separable. w 18 / 33

Finding a linear separator Problem: given training examples S from R d { 1, 1}, determine whether or not there exists w R d such that (and find such a vector if one exists). f w(x) = y, for all (x, y) S; 19 / 33

Finding a linear separator Problem: given training examples S from R d { 1, 1}, determine whether or not there exists w R d such that (and find such a vector if one exists). d variables: w R d S inequalities: for (x, y) S, f w(x) = y, for all (x, y) S; if y = 1: x T w 0, if y = +1: x T w > 0. Can be solved in polynomial time using algorithms for linear programming. 19 / 33

Perceptron (Rosenblatt, 1958) Perceptron input Labeled examples S from R d { 1, +1}. 1: initialize ŵ 1 := 0. 2: for t = 1, 2,..., do 3: if there is an example in S misclassified by fŵt then 4: Let (x t, y t ) be any such misclassified example from S. 5: Update: ŵ t+1 := ŵ t + y t x t. 6: else 7: return ŵ t. 8: end if 9: end for 20 / 33

Perceptron demo Scenario 1 x t ŵ t y t = 1 x T tŵt 0 Current vector ŵ t comparble to x t in length. 21 / 33

Perceptron demo Scenario 1 ŵ t+1 x t ŵ t y t = 1 x T tŵt 0 Updated vector ŵ t+1 now correctly classifies (x t, y t). 21 / 33

Perceptron demo Scenario 2 ŵ t x t y t = 1 x T tŵt 0 Current vector ŵ t much longer than x t. 21 / 33

Perceptron demo Scenario 2 ŵ t+1 ŵ t x y t = 1 t x T tŵt 0 Updated vector ŵ t+1 does not correctly classify (x t, y t). 21 / 33

Perceptron demo Scenario 2 ŵ t+1 ŵ t x y t = 1 t x T tŵt 0 Updated vector ŵ t+1 does not correctly classify (x t, y t). Not obvious that Perceptron will eventually terminate! 21 / 33

When is there a lot of wiggle room? Suppose w R d satisfies min yx T w > 0. (x,y) S 22 / 33

When is there a lot of wiggle room? Suppose w R d satisfies Then so does, e.g., w/100. min yx T w > 0. (x,y) S 22 / 33

When is there a lot of wiggle room? Suppose w R d satisfies min yx T w > 0. (x,y) S Then so does, e.g., w/100. Let s fix a particular scaling of w. 22 / 33

When is there a lot of wiggle room? Suppose w R d satisfies min yx T w > 0. (x,y) S Then so does, e.g., w/100. Let s fix a particular scaling of w. Let ( x, ỹ) be any example in S that achieves the minimum. Rescale w so that ỹ x T w = 1. H ỹ x T w w 2 ỹ x w (Now scaling is fixed.) 1 Now distance from ỹ x to H is. w 2 This distance is called the margin. Therefore, if w is a short vector satisfying min (x,y) S yxt w = 1, then it corresponds to a linear separator with large margin on examples in S. 22 / 33

Perceptron convergence theorem Theorem. Assume there exists w R d such that min (x,y) S yxt w = 1. Let L := max (x,y) S x 2. Then Perceptron halts after w 2 2L 2 iterations. 23 / 33

Perceptron convergence theorem Theorem. Assume there exists w R d such that min (x,y) S yxt w = 1. Let L := max (x,y) S x 2. Then Perceptron halts after w 2 2L 2 iterations. Proof: Suppose Perceptron does not exit the for-loop in iteration t. Then there is a labeled example in S which we call (x t, y t) such that: y tw T x t 1 (by definition of w ) y tŵ T t x t 0 (since fŵt misclassifies the example) ŵ t+1 = ŵ t + y tx t (since an update is made) 23 / 33

Proof of Perceptron convergence theorem Goal: lower-bound w T ŵ t+1 w 2 ŵ t+1 2. 24 / 33

Proof of Perceptron convergence theorem Goal: lower-bound Numerator: w T ŵ t+1 w 2 ŵ t+1 2. w T ŵ t+1 = w T (ŵ t + y tx t) 24 / 33

Proof of Perceptron convergence theorem Goal: lower-bound Numerator: w T ŵ t+1 w 2 ŵ t+1 2. w T ŵ t+1 = w T (ŵ t + y tx t) = w T ŵ t + y tw T x t 24 / 33

Proof of Perceptron convergence theorem Goal: lower-bound Numerator: w T ŵ t+1 w 2 ŵ t+1 2. w T ŵ t+1 = w T (ŵ t + y tx t) = w T ŵ t + y tw T x t w T ŵ t + 1. 24 / 33

Proof of Perceptron convergence theorem Goal: lower-bound Numerator: w T ŵ t+1 w 2 ŵ t+1 2. w T ŵ t+1 = w T (ŵ t + y tx t) = w T ŵ t + y tw T x t w T ŵ t + 1. Therefore, by induction (with ŵ 1 = 0), w T ŵ t+1 t. Denominator: ŵ t+1 2 2 = ŵt + ytxt 2 2 = ŵt 2 2 + 2ytŵT t x t + x t 2 2 ŵt 2 2 + L2. Therefore, by induction (with ŵ 1 = 0), ŵ t+1 2 2 L2 t. 24 / 33

Sneak peek: maximum margin linear classifier We can obtain the linear separator with the largest margin by finding the shortest vector w such that min yx T w = 1. (x,y) S largest margin smaller margin This can be expressed as a mathematical optimization problem that is the basis of support vector machines. (More on this later.) 25 / 33

7. Online Perceptron

Online Perceptron If data is not linearly separable, Perceptron runs forever! 26 / 33

Online Perceptron If data is not linearly separable, Perceptron runs forever! Alternative: consider each example once, then stop. Online Perceptron input Labeled examples ((x i, y i )) n i=1 from Rd { 1, +1}. 1: initialize ŵ 1 := 0. 2: for t = 1, 2,..., n do 3: if y t x T tŵt 0 then 4: ŵ t+1 := ŵ t + y t x t. 5: else 6: ŵ t+1 := ŵ t 7: end if 8: end for 9: return ŵ n+1. 26 / 33

Online learning Online learning algorithms: Go through examples (x 1, y 1), (x 2, y 2),... one-by-one. 27 / 33

Online learning Online learning algorithms: Go through examples (x 1, y 1), (x 2, y 2),... one-by-one. Before seeing (x t, y t), learner has a current classifier ˆf t in hand. 27 / 33

Online learning Online learning algorithms: Go through examples (x 1, y 1), (x 2, y 2),... one-by-one. Before seeing (x t, y t), learner has a current classifier ˆf t in hand. Upon seeing x t, learner makes a prediction: ŷ t := ˆf t(x t). If ŷ t y t, then the prediction was a mistake. See correct label y t; update ˆf t to get ˆf t+1. Typically, update is very computationally cheap to compute. Theorem : If there is a short w with min y t tx T t w 1, then Online Perceptron makes a small number of mistakes. 27 / 33

What good is a small mistake bound? Sequence of classifiers ˆf 1, ˆf 2,... is accurate (on average) in predicting labels of iid sequence (X 1, Y 1), (X 2, Y 2),... X 1 X 2 X 3 X 4 X 5 X 6 X 7 X 8 X 9 X 10 ˆf 1 ˆf2 ˆf3 ˆf4 ˆf5 ˆf6 ˆf7 ˆf8 ˆf9 ˆf10 ŷ 1 ŷ 2 ŷ 3 ŷ 4 ŷ 5 ŷ 6 ŷ 7 ŷ 8 ŷ 9 ŷ 10 Y 1 Y 2 Y 3 Y 4 Y 5 Y 6 Y 7 Y 8 Y 9 Y 10 28 / 33

What good is a small mistake bound? Sequence of classifiers ˆf 1, ˆf 2,... is accurate (on average) in predicting labels of iid sequence (X 1, Y 1), (X 2, Y 2),... Combine classifiers ˆf 1, ˆf 2,... into a single, accurate classifier ˆf. X 1 X 2 X 3 X 4 X 5 X 6 X 7 X 8 X 9 X 10 ˆf 1 ˆf2 ˆf3 ˆf4 ˆf5 ˆf6 ˆf7 ˆf8 ˆf9 ˆf10 ŷ 1 ŷ 2 ŷ 3 ŷ 4 ŷ 5 ŷ 6 ŷ 7 ŷ 8 ŷ 9 ŷ 10 Y 1 Y 2 Y 3 Y 4 Y 5 Y 6 Y 7 Y 8 Y 9 Y 10 ˆf 28 / 33

Online-to-batch conversion Run online learning algorithm on sequence of examples (x 1, y 1),..., (x n, y n) (in random order) to produce sequence of binary classifiers ˆf 1,..., ˆf n+1, ( ˆf i : X {±1}). 29 / 33

Online-to-batch conversion Run online learning algorithm on sequence of examples (x 1, y 1),..., (x n, y n) (in random order) to produce sequence of binary classifiers ( ˆf i : X {±1}). ˆf 1,..., ˆf n+1, Final classifier: majority vote over the ˆf i s { 1 if n+1 i=1 ˆf(x) := ˆf i(x) 0, +1 if n+1 ˆf i=1 i(x) > 0. 29 / 33

Practical suggestions for online-to-batch variant 30 / 33

Practical suggestions for online-to-batch variant Run online learning algorithm to make a few (e.g., two) passes over training examples, each time in a different random order. Pass 1: (x 2, y 2 ), (x 9, y 9 ), (x 6, y 6 ), (x 1, y 1 ), (x 3, y 3 ), (x 4, y 4 ), (x 10, y 10 ), (x 8, y 8 ), (x 7, y 7 ), (x 5, y 5 ) Pass 2: (x 4, y 4 ), (x 5, y 5 ), (x 1, y 1 ), (x 2, y 2 ), (x 3, y 3 ), (x 10, y 10 ), (x 7, y 7 ), (x 9, y 9 ), (x 6, y 6 ), (x 8, y 8 ) (Total of 2n rounds) For linear classifiers, instead of using majority vote to combine, just average weight vectors. ŵ := 2n+1 1 ŵ i. 2n + 1 i=1 30 / 33

Example: restaurant review classification Data: 1M reviews of Pittsburgh restaruants (from Yelp). Y = {at least 4 stars, below 4 stars} (66.2% have at least 4 (out of 5) stars.) Reviews represented as bags-of-words: e.g., x great = # times great appears in review. (d 200000 unique words.) Results Online Perceptron + affine expansion + online-to-batch variant: Training risk: 10.1%. Test risk (on 320123 held-out test examples): 10.5%. Running time: 1.5 minutes on single Intel Xeon 2.30GHz CPU. (This omits time to load data, which was another 4 minutes from the raw text.) 31 / 33

Other approaches for learning linear (or affine) classifiers Naïve Bayes generative model MLE Gaussian generative model MLE (sometimes) Methods for linear regression (+ thresholding) 32 / 33

Other approaches for learning linear (or affine) classifiers Naïve Bayes generative model MLE Gaussian generative model MLE (sometimes) Methods for linear regression (+ thresholding) Logistic regression, linear programming, Perceptron, Online Perceptron 32 / 33

Key takeaways 1. Logistic regression model. 2. Structure/geometry of linear classifiers. 3. Empirical risk minimization for linear classifiers and intractability. 4. Linear separability; two approaches to find a linear separator. 5. Online Perceptron; online-to-batch conversion. 33 / 33