Lecture 4: Linear predictors and the Perceptron

Size: px

Start display at page:

Download "Lecture 4: Linear predictors and the Perceptron"

Mercy Day
5 years ago
Views:

1 Lecture 4: Linear predictors and the Perceptron Introduction to Learning and Analysis of Big Data Kontorovich and Sabato (BGU) Lecture 4 1 / 34

2 Inductive Bias Inductive bias is critical to prevent overfitting. Inductive bias = a relatively simple hypothesis class H. What if we don t know which H is suitable for our learning problem? Choose a good representation (relevant features) Use a general purpose hypothesis class. One of the most popular choices: Linear predictors. Kontorovich and Sabato (BGU) Lecture 4 2 / 34

3 Linear predictors Recall that in many learning problems X = R d. Each example is a vector with d coordinates (features). In binary classification problems, Y consists of two labels. In linear prediction, H is the class of all linear separators. Illustration with d = 2: Kontorovich and Sabato (BGU) Lecture 4 3 / 34

4 Why restrict to linear predictors? If the algorithm could choose any separation, we could get overfitting: Labels of unseen examples would not be predicted correctly. E.g. if using a squiggly line. Kontorovich and Sabato (BGU) Lecture 4 4 / 34

5 Preventing overfitting Linear predictors prevent overfitting: In dimension d, if training sample size is Θ(d), then training error true prediction error. Dimension d number of features. So, if we use linear predictors with enough training samples, does this guarantee low prediction error? Recall the No-Free-Lunch theorem... Kontorovich and Sabato (BGU) Lecture 4 5 / 34

6 Preventing overfitting is not enough No overfitting: error on training sample true prediction error on distribution. Can still have high error on training sample Kontorovich and Sabato (BGU) Lecture 4 6 / 34

7 Preventing overfitting is not enough Recall the decomposition of the prediction error err(ĥs, D): Approximation error: err app := inf err(h, D) h H Estimation error err est := err(ĥs, D) inf err(h, D). h H No overfitting estimation error is low. But we also need low approximation error: if all linear predictors have high error, no sample size will work. In practice, this is often the case with linear predictors. If good features are chosen to represent the examples! Kontorovich and Sabato (BGU) Lecture 4 7 / 34

8 Formalizing linear predictors To formalize linear predictors we will use inner products. Definition: For vectors x, z R d, x, z := d i=1 x(i)z(i). The length of a vector x R d can be defined by its inner product: x = d x(i) 2 = x, x. i=1 The angle between two vectors is defined by their inner product: cos(θ) = (Large value = small angle.) x, z x z. Inner products are commutative: x, z = z, x. Inner products are linear: If a, b R and x, x, z R d, a x, z = a x, z, x + x, z = x, z + x, z. Kontorovich and Sabato (BGU) Lecture 4 8 / 34

9 Formalizing linear predictors Call the labels Y = { 1, +1}. In 2 dimensions: the line is a x(1) + b = x(2) The linear prediction rule is: { +1 a x(1) + b x(2) y = 1 a x(1) + b < x(2). Kontorovich and Sabato (BGU) Lecture 4 9 / 34

10 Formalizing linear predictors In two dimensions: y = { +1 a x(1) + b x(2) 1 a x(1) + b < x(2). Define a vector w = (w(1), w(2)) = (a, 1). Can rewrite this as: y = sign(w(1) x(1) + w(2) x(2) + b) = sign( w, x + b). For a vector w R d and a number b R, define the linear predictor h w,b : x R d, b is called the bias of the predictor. h w,b (x) := sign( w, x + b). Hypothesis class of all linear predictors in dimension d: H d L := {h w,b w R d, b R}. Kontorovich and Sabato (BGU) Lecture 4 10 / 34

11 Formalizing linear predictors h w,b (x) := sign( w, x + b), H d L := {h w,b w R d, b R}. In 3 dimensions, the linear boundary w, x + b = 0 is a plane. In higher dimensions, it is a hyperplane. The vector w is the normal to the hyperplane b / w is the distance from the origin to the hyperplane. w b / w Kontorovich and Sabato (BGU) Lecture 4 11 / 34

12 The bias b is not needed Suppose we have a classification problem with X = R d For every example x R d, define x R d+1 : x = (x(1),..., x(d)) = x := (x(1),..., x(d), 1) For every linear predictor with a bias h w,b on R d, define a linear predictor h w without a bias on R d+1 : w := (w(1),..., w(d), b). We get h w,b (x) = h w (x ), for all x, w, b: h w (x ) = sign( x, w ) = sign( x, w + b) = h w,b (x). Conclusion: by adding a coordinate which is always 1, we can discard the bias term. Kontorovich and Sabato (BGU) Lecture 4 12 / 34

13 Removing the bias term From 1 dimension with a bias: To two dimensions without a bias: Linear predictors without a bias are called homogeneous. Linear predictors are also called halfspaces. Kontorovich and Sabato (BGU) Lecture 4 13 / 34

14 Implementing the ERM for linear predictors Implementing ERM: Find a linear predictor h w with a minimal empirical error: err(h w, S) 1 m {i sign( x i, w ) y i }. This problem is NP-hard There are workarounds (later in the course). Today: efficient algorithm if problem is realizable. Definition D is realizable by H if there exists some h H such that err(h, D) = 0. Kontorovich and Sabato (BGU) Lecture 4 14 / 34

15 ERM in the realizable case Definition D is realizable by H if there exists some h H such that err(h, D) = 0. For any x i in the training sample, y i = h (x i ). So, min h H err(h, S) err(h, S) = 0. ERM in the realizable case: find some h H such that err(h, S) = 0. For linear predictors: find h w,b that separates the positive and negative labels in the training sample. Can be done efficiently We will see two efficient methods. For linear predictors: realizable = separable. Kontorovich and Sabato (BGU) Lecture 4 15 / 34

16 ERM for separable linear predictors: Linear Programming A linear program (LP) is a problem of the following form: maximize w R d u, w subject to Aw v. w R d : a vector we wish to find. u R d, v R m, A R m d. The values of u, v, A define the specific linear program. LPs can be solved efficiently Many solvers are available. In Matlab: w = linprog(-u,-a,-v). Kontorovich and Sabato (BGU) Lecture 4 16 / 34

17 ERM for separable linear predictors: Linear Programming Linear Program maximize w R d u, w subject to Aw v ERM for the separable case: Find a linear predictor with zero error on the training sample {(x i, y i )} i m. Recall y i { 1, +1}. Our goal: This is equivalent to: Find w R d s.t. i m, sign( x i, w ) = y i. Find w R d s.t. i m, y i x i, w > 0. Problem: in the linear program we have a weak inequality, not strict >. If we use here, w = 0 satisfies the constraints Kontorovich and Sabato (BGU) Lecture 4 17 / 34

18 ERM for separable linear predictors: Linear Programming Our goal: Linear Program maximize w R d u, w subject to Aw v Find w R d s.t. i m, y i x i, w > 0. Need to change our the strict inequality to a weak one. If the problem is separable, there exists a solution. Name one of the solutions w. Denote γ := min i y i x i, w. Note γ > 0. Define w = w /γ. For all i m, y i x i, w = y i x i, w /γ 1. Conclusion: There is a predictor w R d such that i m, y i x i, w 1. Also, any predictor that satisfies this is a good solution. Kontorovich and Sabato (BGU) Lecture 4 18 / 34

19 ERM for separable linear predictors: Linear Programming Linear Program maximize w R d u, w subject to Our goal can be re-written as: Aw v Find w R d s.t. i m, y i x i, w 1. Turn this into the form of a linear program: u = (0,..., 0) (nothing is maximized), v = (1,..., 1) Row i of the matrix A is yi x i (y i x i (1),..., y i x i (d)). The linear program: maximize w R d 0 y 1 x 1 (1),..., y 1 x 1 (d) 1 subject to... w.... y m x m (1),..., y m x m (d) 1 Kontorovich and Sabato (BGU) Lecture 4 19 / 34

20 ERM for separable linear predictors: Linear Programming The LP approach is very easy to implement But it can be slow. And it fails completely if there is even one bad label. Kontorovich and Sabato (BGU) Lecture 4 20 / 34

21 The Perceptron The Perceptron algorithm was invented in 1958 by Rosenblatt. This version is called the Batch Perceptron. Goal remains as before: Find a linear predictor with zero error on the training sample {(x i, y i )} i m. Perceptron Idea: Work in rounds Start with a default predictor In each round, look at a single training example If current predictor is wrong on this example, move predictor in the right direction. Stop when the predictor assigns all examples with the correct label. Kontorovich and Sabato (BGU) Lecture 4 21 / 34

22 The Perceptron Batch Perceptron input A training sample S = {(x 1, y 1 ),..., (x m, y m )} output w R d such that i m, h w (x i ) = y i. 1: w (1) (0,..., 0) 2: while i s.t. y i w (t), x i 0 do 3: w (t+1) w (t) + y i x i 4: t t + 1 5: end while 6: Return w (t). Why does the update rule make sense? y i w (t+1), x i = y i w (t) + y i x i, x i = y i w (t), x i + y 2 i x i, x i = y i w (t), x i + x i 2. Each update moves y i w (t), x i closer to being positive. Kontorovich and Sabato (BGU) Lecture 4 22 / 34

23 The Perceptron Illustration in two dimensional space (d = 2) - on the board. The separator tilts in the right direction in each update The same example can be repeated several times. Does this always work? How many updates does it take to get an error-free separator? Kontorovich and Sabato (BGU) Lecture 4 23 / 34

24 The separation margin Intuitively, separation is easier if positive and negative points are far apart. far apart there is a separator which is far from all points. w distance from point Claim We will show that the Perceptron is indeed faster in this case. First, let s make this formal. w,x w is the distance between x R d and the separator defined by w. Kontorovich and Sabato (BGU) Lecture 4 24 / 34

25 The separation margin Claim: distance between x and the hyperplane defined by w is w,x w. Define w := w/ w. Then w = w, w = 1. The hyperplane is H = {v v, w = 0} = {v v, w = 0}. The distance between the hyperplane and x is := min v H x v. Take v = x w, x w. Then v H, because v, w = x, w ( w, x w), w = x, w w, x w, w = 0. The distance is at most x v. x v = w, x w = w, x w = w, x = Also, for any u H: w, x w. x u 2 = x v + v u 2 = (x v) + (v u), (x v) + (v u) = x v x v, v u + v u 2 And x v, v u = w, x w, v u = 0. w, x 2 w x v, v u. Kontorovich and Sabato (BGU) Lecture 4 25 / 34

26 The separation margin Denote R := max i x i. We will normalize by this value. The minimal normalized distance of any x i in S from w is called the margin of w. γ(w) := 1 R min w, x i i m w. w Kontorovich and Sabato (BGU) Lecture 4 26 / 34

27 The separation margin Which separators have a large margin? For any α > 0, w R d, α w defines the same separator, with the same margin. We can look at separators w such that min i m y i w, x i = 1. This does not lose generality! Then (R := max i x i ) γ(w) = 1 R w Small norm w and small R = large margin γ(w). Kontorovich and Sabato (BGU) Lecture 4 27 / 34

28 The Perceptron: Guarantee Theorem (Theorem 9.1 in course book) Assume that S = ((x 1, y 1 ),..., (x m, y m )) is separable. Then 1. When the Perceptron stops and returns w (t), i m, y i w (t), x i > Define R := max i [m] x i. B := min{ w i m, y i w, x i 1}. The Perceptron performs at most (RB) 2 updates. Part (1) is trivial: Perceptron never stops unless this holds. γ(w) = 1 R w Let γ S be the largest margin achievable on S. Then γ S = 1 RB. Number of updates 1/γ 2 S. Kontorovich and Sabato (BGU) Lecture 4 28 / 34

29 The Perceptron: Proving the theorem We will show that if the Perceptron runs for at least T iterations, then T (RB) 2. So total number of iterations is at most (RB) 2. Let w such that y i w, x i 1 and w = B. We will keep track of two quantities: w (t) and w, w (t). We will show that the norm grows slowly, while the inner product grows fast. More precisely: w (t+1) tr w, w (t+1) t. Recall that larger w,w (t+1) w w (t+1) = smaller angle between w, w (t). Reminder: Cauchy-Schwarz inequality. For all u, v R d, So we will conclude: u, v u v. T w, w (T +1) w w (T +1) B T R = T (RB) 2. Kontorovich and Sabato (BGU) Lecture 4 29 / 34

30 The Perceptron: Proving the theorem Upper bounding the norm w (T +1) : In iteration t, let i be the example that was used to update w (t). Recall the Perceptron update: w (t+1) w (t) + y i x i We have y i w (t), x i 0. So w (1) 2 = 0. w (t+1) 2 = w (t) + y i x i 2 By induction, w (T +1) 2 TR 2 So w (T +1) T R. = w (t) 2 + 2y i w (t), x i + y 2 i x i 2 w (t) 2 + R 2. Kontorovich and Sabato (BGU) Lecture 4 30 / 34

31 The Perceptron: Proving the theorem Lower bounding the inner product w, w (T +1) w (1) = (0,..., 0) = w, w (1) = 0. Recall the Perceptron update: w (t+1) w (t) + y i x i In each iteration w, w (t) is increased by at least one: w, w (t+1) w, w (t) = w, w (t+1) w (t) = w, y i x i = y i w, x i (from the definition of w ) 1 After T iterations: T w, w (T +1) = ( w, w (t+1) w, w (t) ) T. t=1 This means that our w gets closer to w at each iteration Kontorovich and Sabato (BGU) Lecture 4 31 / 34

32 The Perceptron: Proving the theorem We showed: w (T +1) T R w, w (T +1) T. Using Cauchy-Schwarz: T w, w (T +1) w w (T +1) B T R Conclusion: T (RB) 2. We showed The Perceptron runs for at most (RB) 2 iterations. When it stops, the separator it returns separates the examples in S. γ S := the best possible margin on S, then 1 RB = γ S. So, number of iterations: O(1/γ 2 S ). Kontorovich and Sabato (BGU) Lecture 4 32 / 34

33 Perceptron properties Processes one example at a time: low working-memory Number of updates depends on margin. If margin is very small, Perceptron might take an Ω(2 d ) time to converge. In practice, in many natural problems, margin is large and Perceptron is faster than LP. What if the training sample is not separable? LP will completely fail The Perceptron can still run, but will not terminate on its own No guarantee for the Perceptron in this case. Kontorovich and Sabato (BGU) Lecture 4 33 / 34

34 Linear predictors: Intermediate summary Linear predictors are very popular, because If the sample size is Θ(d) (e.g. 10 times the dimensions), the training error and the true error will probably be similar For many natural problems, there are linear predictors with low error. Computing the ERM for a linear predictor is NP-hard. But in the realizable/separable case, there are efficient algorithms: Using linear programming; The Batch Perceptron algorithm, if the margin is not too small. Next: What to do when the problem is not separable. Kontorovich and Sabato (BGU) Lecture 4 34 / 34

Lecture 3: Empirical Risk Minimization

Lecture 3: Empirical Risk Minimization Introduction to Learning and Analysis of Big Data Kontorovich and Sabato (BGU) Lecture 3 1 / 11 A more general approach We saw the learning algorithms Memorize and