The Perceptron Algorithm

Size: px

Start display at page:

Download "The Perceptron Algorithm"

Derick Mosley
5 years ago
Views:

1 The Perceptron Algorithm Machine Learning Spring 2018 The slides are mainly from Vivek Srikumar 1

2 Outline The Perceptron Algorithm Perceptron Mistake Bound Variants of Perceptron 2

3 Where are we? The Perceptron Algorithm Perceptron Mistake Bound Variants of Perceptron 3

4 Recall: Linear Classifiers Input is a n dimensional vector x Output is a label y {-1, 1} Linear Threshold Units (LTUs) classify an example x using the following classification rule Output = sgn(w T x + b) = sgn(b +åw i x i ) w T x + b 0 Predict y = 1 w T x + b < 0 Predict y = -1 b is called the bias term 4

Recall: Linear Classifiers Input is a n dimensional vector x Output is a label y {-1, 1} Linear Threshold Units (LTUs) classify an example x using the following classification rule Output =

5 Recall: Linear Classifiers Input is a n dimensional vector x Output is a label y {-1, 1} Linear Threshold Units (LTUs) classify an example x using the following classification rule Output = sgn(w T x + b) = sgn(b +åw i x i ) w T x + b 0 ) Predict y = 1 w T x + b < 0 ) Predict y = -1 sgn & ' & ( & ) & * & + &, & - &. / b is called the ' / bias ( / term ) / * / + /, / - /

6 The geometry of a linear classifier b +w 1 x 1 + w 2 x 2 =0 sgn(b +w 1 x 1 + w 2 x 2 ) We only care about the sign, not the magnitude + [w 1 w 2 ] x 1 In n dimensions, a linear classifier represents a hyperplane that separates the space into two half-spaces x 2 6

7 The Perceptron 7

8 The hype The New Yorker, December 6, 1958 P. 44 The New York Times, July The IBM 704 computer 8

9 The Perceptron algorithm Rosenblatt 1958 The goal is to find a separating hyperplane For separable data, guaranteed to find one An online algorithm Processes one example at a time Several variants exist (will discuss briefly at towards the end) 9

10 The Perceptron algorithm Input: A sequence of training examples (x 1, y 1 ), (x 2, y 2 ),! where all x i R n, y i {-1,1} Initialize w 0 = 0 R n For each training example (x i, y i ): Predict y = sgn(w tt x i ) If y i y : Update w t+1 w t + r (y i x i ) Return final weight vector 10

11 The Perceptron algorithm Input: A sequence of training examples (x 1, y 1 ), (x 2, y 2 ),! where all x i R n, y i {-1,1} Initialize w 0 = 0 R n For each training example (x i, y i ): Predict y = sgn(w tt x i ) If y i y : Update w t+1 w t + r (y i x i ) Return final weight vector Remember: Prediction = sgn(w T x) There is typically a bias term also (w T x + b), but the bias may be treated as a constant feature and folded into w 11

12 The Perceptron algorithm Input: A sequence of training examples (x 1, y 1 ), (x 2, y 2 ),! where all x i R n, y i {-1,1} Initialize w 0 = 0 R n For each training example (x i, y i ): Predict y = sgn(w tt x i ) If y i y : Update w t+1 w t + r (y i x i ) Return final weight vector Remember: Prediction = sgn(w T x) There is typically a bias term also (w T x + b), but the bias may be treated as a constant feature and folded into w Footnote: For some algorithms it is mathematically easier to represent False as -1, and at other times, as 0. For the Perceptron algorithm, treat -1 as false and +1 as true. 12

13 The Perceptron algorithm Input: A sequence of training examples (x 1, y 1 ), (x 2, y 2 ),! where all x i R n, y i {-1,1} Initialize w 0 = 0 R n For each training example (x i, y i ): Predict y = sgn(w tt x i ) If y i y : Update w t+1 w t + r (y i x i ) Return final weight vector Mistake on positive: w t+1 w t + r x i Mistake on negative: w t+1 w t - r x i 13

14 The Perceptron algorithm Input: A sequence of training examples (x 1, y 1 ), (x 2, y 2 ),! where all x i R n, y i {-1,1} Initialize w 0 = 0 R n For each training example (x i, y i ): Predict y = sgn(w tt x i ) If y i y : Update w t+1 w t + r (y i x i ) Return final weight vector Mistake on positive: w t+1 w t + r x i Mistake on negative: w t+1 w t - r x i r is the learning rate, a small positive number less than 1 14

15 The Perceptron algorithm Input: A sequence of training examples (x 1, y 1 ), (x 2, y 2 ),! where all x i R n, y i {-1,1} Initialize w 0 = 0 R n For each training example (x i, y i ): Predict y = sgn(w tt x i ) If y i y : Update w t+1 w t + r (y i x i ) Return final weight vector Mistake on positive: w t+1 w t + r x i Mistake on negative: w t+1 w t - r x i r is the learning rate, a small positive number less than 1 Update only on error. A mistakedriven algorithm 15

16 The Perceptron algorithm Input: A sequence of training examples (x 1, y 1 ), (x 2, y 2 ),! where all x i R n, y i {-1,1} Initialize w 0 = 0 R n For each training example (x i, y i ): Predict y = sgn(w tt x i ) If y i y : Update w t+1 w t + r (y i x i ) Return final weight vector Mistake on positive: w t+1 w t + r x i Mistake on negative: w t+1 w t - r x i r is the learning rate, a small positive number less than 1 Update only on error. A mistakedriven algorithm This is the simplest version. We will see more robust versions at the end 16

17 The Perceptron algorithm Input: A sequence of training examples (x 1, y 1 ), (x 2, y 2 ),! where all x i R n, y i {-1,1} Initialize w 0 = 0 R n For each training example (x i, y i ): Predict y = sgn(w tt x i ) If y i y : Update w t+1 w t + r (y i x i ) Return final weight vector Mistake on positive: w t+1 w t + r x i Mistake on negative: w t+1 w t - r x i r is the learning rate, a small positive number less than 1 Update only on error. A mistakedriven algorithm This is the simplest version. We will see more robust versions at the end Mistake can be written as y i w tt x i 0 17

18 Intuition behind the update Mistake on positive: w t+1 w t + r x i Mistake on negative: w t+1 w t - r x i Suppose we have made a mistake on a positive example That is, y = +1 and w tt x <0 Call the new weight vector w t+1 = w t + x (say r = 1) The new dot product will be w t+1t x = (w t + x) T x = w tt x + x T x > w tt x For a positive example, the Perceptron update will increase the score assigned to the same input Similar reasoning for negative examples 18

19 Geometry of the perceptron update Predict Mistake on positive: w t+1 w t + r x i Mistake on negative: w t+1 w t - r x i w old 19

20 Geometry of the perceptron update Predict Mistake on positive: w t+1 w t + r x i Mistake on negative: w t+1 w t - r x i w old (x, +1) 20

21 Geometry of the perceptron update Predict Mistake on positive: w t+1 w t + r x i Mistake on negative: w t+1 w t - r x i w old (x, +1) For a mistake on a positive example 21

22 Geometry of the perceptron update Predict Update Mistake on positive: w t+1 w t + r x i Mistake on negative: w t+1 w t - r x i w old w w + x (x, +1) (x, +1) For a mistake on a positive example 22

23 Geometry of the perceptron update Predict Update Mistake on positive: w t+1 w t + r x i Mistake on negative: w t+1 w t - r x i w old x w w + x (x, +1) (x, +1) For a mistake on a positive example 23

24 Geometry of the perceptron update Predict Update Mistake on positive: w t+1 w t + r x i Mistake on negative: w t+1 w t - r x i w old x w w + x (x, +1) (x, +1) For a mistake on a positive example 24

25 Mistake on positive: w t+1 w t + r x i Mistake on negative: w t+1 w t - r x i Geometry of the perceptron update Predict Update After w old x w w + x w new (x, +1) (x, +1) (x, +1) For a mistake on a positive example 25

26 Geometry of the perceptron update Predict w old 26

27 Geometry of the perceptron update Predict (x, -1) w old For a mistake on a negative example 27

28 Geometry of the perceptron update Predict Update w w - x (x, -1) w old (x, -1) - x For a mistake on a negative example 28

29 Geometry of the perceptron update Predict Update w w - x (x, -1) w old (x, -1) - x For a mistake on a negative example 29

30 Geometry of the perceptron update Predict Update w w - x (x, -1) w old (x, -1) - x For a mistake on a negative example 30

31 Geometry of the perceptron update Predict Update After w w - x (x, -1) w old (x, -1) - x (x, -1) w new For a mistake on a negative example 31

32 Perceptron Learnability Obviously Perceptron cannot learn what it cannot represent Only linearly separable functions Minsky and Papert (1969) wrote an influential book demonstrating Perceptron s representational limitations Parity functions can t be learned (XOR) But we already know that XOR is not linearly separable Feature transformation trick 32

33 What you need to know The Perceptron algorithm The geometry of the update What can it represent 33

34 Where are we? The Perceptron Algorithm Perceptron Mistake Bound Variants of Perceptron 34

35 Convergence Convergence theorem If there exist a set of weights that are consistent with the data (i.e. the data is linearly separable), the perceptron algorithm will converge. 35

36 Convergence Convergence theorem If there exist a set of weights that are consistent with the data (i.e. the data is linearly separable), the perceptron algorithm will converge. Cycling theorem If the training data is not linearly separable, then the learning algorithm will eventually repeat the same set of weights and enter an infinite loop 36

37 Margin The margin of a hyperplane for a dataset is the distance between the hyperplane and the data point nearest to it Margin with respect to this hyperplane 37

38 Margin The margin of a hyperplane for a dataset is the distance between the hyperplane and the data point nearest to it. The margin of a data set (!) is the maximum margin possible for that dataset using any weight vector Margin of the data 38

39 Mistake Bound Theorem [Novikoff 1962, Block 1962] Let {(x 1, y 1 ), (x 2, y 2 ),!, (x m, y m )} be a sequence of training examples such that for all i, the feature vector x i R n, x i R and the label y i {-1, +1}. Suppose there exists a unit vector u 2 < n (i.e u = 1) such that for some 2 < and > 0 we have y i (u T x i ). Then, the perceptron algorithm will make at most (R/ ) 2 mistakes on the training sequence. 39

40 Mistake Bound Theorem [Novikoff 1962, Block 1962] Let {(x 1, y 1 ), (x 2, y 2 ),!, (x m, y m )} be a sequence of training examples such that for all i, the feature vector x i R n, x i R and the label y i {-1, +1}. Suppose there exists a unit vector u R n (i.e u = 1) such that for some $ R and $ > 0 we have y i (u T x i ) $. Then, the perceptron algorithm will make at most (R/ ) 2 mistakes on the training sequence. 40

41 Mistake Bound Theorem [Novikoff 1962, Block 1962] Let {(x 1, y 1 ), (x 2, y 2 ),!, (x m, y m )} be a sequence of training examples such that for all i, the feature vector x i R n, x i R and the label y i {-1, +1}. Suppose there exists a unit vector u R n (i.e u = 1) such that for some $ R and $ > 0 we have y i (u T x i ) $. Then, the perceptron algorithm will make at most (R/ $) 2 mistakes on the training sequence. 41

42 Mistake Bound Theorem [Novikoff 1962, Block 1962] Let {(x 1, y 1 ), (x 2, y 2 ),!, (x m, y m )} be a sequence of training examples such that for all i, the feature vector x i R n, x i R and the label y i {-1, +1}. We can always find such an R. Just look for the farthest data point from the origin. Suppose there exists a unit vector u R n (i.e u = 1) such that for some $ R and $ > 0 we have y i (u T x i ) $. Then, the perceptron algorithm will make at most (R/ $) 2 mistakes on the training sequence. 42

43 Mistake Bound Theorem [Novikoff 1962, Block 1962] Let {(x 1, y 1 ), (x 2, y 2 ),!, (x m, y m )} be a sequence of training examples such that for all i, the feature vector x i R n, x i R and the label y i {-1, +1}. Suppose there exists a unit vector u R n (i.e u = 1) such that for some! R and! > 0 we have y i (u T x i )!. Then, the perceptron algorithm will make at most (R/!) 2 mistakes on the training sequence. The data and u have a margin!. Importantly, the data is separable.! is the complexity parameter that defines the separability of data. 43

44 Mistake Bound Theorem [Novikoff 1962, Block 1962] Let {(x 1, y 1 ), (x 2, y 2 ),!, (x m, y m )} be a sequence of training examples such that for all i, the feature vector x i R n, x i R and the label y i {-1, +1}. Suppose there exists a unit vector u R n (i.e u = 1) such that for some! R and! > 0 we have y i (u T x i )!. Then, the perceptron algorithm will make at most (R/!) 2 mistakes on the training sequence. If u hadn t been a unit vector, then we could scale! in the mistake bound. This will change the final mistake bound to ( u R/!) 2 44

45 Mistake Bound Theorem [Novikoff 1962, Block 1962] Let {(x 1, y 1 ), (x 2, y 2 ),!, (x m, y m )} be a sequence of training examples such that for all i, the feature vector x i R n, x i R and the label y i {-1, +1}. Suppose we have a binary classification dataset with n dimensional inputs. Suppose there exists a unit vector u R n (i.e u = 1) such that for some $ R and $ > 0 we have y i (u T x i ) $. If the data is separable, Then, the perceptron algorithm will make at most (R/ $) 2 mistakes on the training sequence. then the Perceptron algorithm will find a separating hyperplane after making a finite number of mistakes 45

46 Proof (preliminaries) Receive an input (x i, y i ) if sgn(w tt x i ) y i : Update w t+1 w t + y i x i The setting Initial weight vector w is all zeros Learning rate = 1 Effectively scales inputs, but does not change the behavior All training examples are contained in a ball of size R x i R The training data is separable by margin " using a unit vector u y i (u T x i ) " 46

47 Proof (1/3) Receive an input (x i, y i ) if sgn(w tt x i ) y i : Update w t+1 w t + y i x i 1. Claim: After t mistakes, u T w t t " 47

48 Proof (1/3) Receive an input (x i, y i ) if sgn(w tt x i ) y i : Update w t+1 w t + y i x i 1. Claim: After t mistakes, u T w t t " Because the data is separable by a margin " 48

49 Proof (1/3) Receive an input (x i, y i ) if sgn(w tt x i ) y i : Update w t+1 w t + y i x i 1. Claim: After t mistakes, u T w t t " Because the data is separable by a margin " Because w 0 = 0 (i.e u T w 0 = 0), straightforward induction gives us u T w t t " 49

50 Proof (2/3) Receive an input (x i, y i ) if sgn(w tt x i ) y i : Update w t+1 w t + y i x i 2. Claim: After t mistakes, w t 2 tr 2 50

51 Proof (2/3) Receive an input (x i, y i ) if sgn(w tt x i ) y i : Update w t+1 w t + y i x i 2. Claim: After t mistakes, w t 2 tr 2 The weight is updated only when there is a mistake. That is when y i w tt x i < 0. x i R, by definition of R 51

52 Proof (2/3) Receive an input (x i, y i ) if sgn(w tt x i ) y i : Update w t+1 w t + y i x i 2. Claim: After t mistakes, w t 2 tr 2 Because w 0 = 0 (i.e u T w 0 = 0), straightforward induction gives us w t 2 tr 2 52

53 Proof (3/3) What we know: 1. After t mistakes, u T w t t" 2. After t mistakes, w t 2 tr 2 53

54 Proof (3/3) What we know: 1. After t mistakes, u T w t t" 2. After t mistakes, w t 2 tr 2 From (2) 54

55 Proof (3/3) What we know: 1. After t mistakes, u T w t t" 2. After t mistakes, w t 2 tr 2 From (2) u T w t = u w t cos(<angle between them>) But u = 1 and cosine is less than 1 So u T w t w t 55

56 Proof (3/3) What we know: 1. After t mistakes, u T w t t" 2. After t mistakes, w t 2 tr 2 From (2) u T w t = u w t cos(<angle between them>) But u = 1 and cosine is less than 1 So u T w t w t (Cauchy-Schwarz inequality) 56

57 Proof (3/3) What we know: 1. After t mistakes, u T w t t# 2. After t mistakes, w t 2 tr 2 From (2) From (1) u T w t = u w t cos(<angle between them>) But u = 1 and cosine is less than 1 So u T w t w t 57

58 Proof (3/3) What we know: 1. After t mistakes, u T w t t" 2. After t mistakes, w t 2 tr 2 Number of mistakes 58

59 Proof (3/3) What we know: 1. After t mistakes, u T w t t" 2. After t mistakes, w t 2 tr 2 Number of mistakes Bounds the total number of mistakes! 59

60 Mistake Bound Theorem [Novikoff 1962, Block 1962] Let {(x 1, y 1 ), (x 2, y 2 ),!, (x m, y m )} be a sequence of training examples such that for all i, the feature vector x i R n, x i R and the label y i {-1, +1}. Suppose there exists a unit vector u R n (i.e u = 1) such that for some $ R and $ > 0 we have y i (u T x i ) $. Then, the perceptron algorithm will make at most (R/ $) 2 mistakes on the training sequence. 60

61 Mistake Bound Theorem [Novikoff 1962, Block 1962] 61

62 Mistake Bound Theorem [Novikoff 1962, Block 1962] X R 62

63 Mistake Bound Theorem [Novikoff 1962, Block 1962] X X R 63

64 Mistake Bound Theorem [Novikoff 1962, Block 1962] X X R Number of mistakes 64

65 The Perceptron Mistake bound Number of mistakes Exercises: How many mistakes will the Perceptron algorithm make for disjunctions with n attributes? What are R and!? How many mistakes will the Perceptron algorithm make for k- disjunctions with n attributes? 65

66 What you need to know What is the perceptron mistake bound? How to prove it 66

67 Where are we? The Perceptron Algorithm Perceptron Mistake Bound Variants of Perceptron 67

68 Practical use of the Perceptron algorithm 1. Using the Perceptron algorithm with a finite dataset 2. Margin Perceptron 3. Voting and Averaging 68

69 1. The standard algorithm Given a training set D = {(x i, y i )}, x i R n, y i {-1,1} 1. Initialize w = 0 R n 2. For epoch = 1 T: 1. Shuffle the data 2. For each training example (x i, y i ) D: If y i w T x i 0, update w w + r y i x i 3. Return w Prediction: sgn(w T x) 69

70 1. The standard algorithm Given a training set D = {(x i, y i )}, x i R n, y i {-1,1} 1. Initialize w = 0 R n 2. For epoch = 1 T: 1. Shuffle the data 2. For each training example (x i, y i ) D: If y i w T x i 0, update w w + r y i x i 3. Return w Prediction: sgn(w T x) T is a hyper-parameter to the algorithm Another way of writing that there is an error 70

71 2. Margin Perceptron Which hyperplane is better? h 1 h 2 71

72 2. Margin Perceptron Which hyperplane is better? h 1 h 2 72

73 2. Margin Perceptron Which hyperplane is better? h 1 h 2 The farther from the data points, the less chance to make wrong prediction 73

74 2. Margin Perceptron Perceptron makes updates only when the prediction is X incorrect y i w T x i < 0 X What if the prediction is close to being incorrect? That is, Pick a positive! and update when y i w > x i kwk < Can generalize better, but need to choose Why is this a good idea? 74

75 2. Margin Perceptron Perceptron makes updates only when the prediction is X incorrect y i w T x i < 0 X What if the prediction is close to being incorrect? That is, Pick a positive! and update when y i w > x i kwk < Can generalize better, but need to choose Why is this a good idea? Intentionally set a large margin 75

76 3. Voting and Averaging What if data is not linearly separable? Finding a hyperplane with minimum mistakes is NP hard 76

77 Voted Perceptron Given a training set D = {(x i, y i )}, x i R n, y i {-1,1} 1. Initialize w = 0 R n and a = 0 R n 2. For epoch = 1 T: For each training example (x i, y i ) D: If y i w T x i 0 update w m+1 w m + r y i x i m=m+1 Z C m = 1 Else C m = C m Return (w 1, c 1 ), (w 2, c 2 ),, (w k, C k ) kx Prediction: sgn c i sgn(w i > x) i=1 77

78 Averaged Perceptron Given a training set D = {(x i, y i )}, x i R n, y i {-1,1} 1. Initialize w = 0 R n and a = 0 R n 2. For epoch = 1 T: For each training example (x i, y i ) D: If y i w T x i 0 X update w w + r y i x i a a + w 3. Return a Prediction: sgn(a T x) = sgn kx c i w i > x i=1 78

79 Averaged Perceptron Given a training set D = {(x i, y i )}, x i R n, y i {-1,1} 1. Initialize w = 0 R n and a = 0 R n 2. For epoch = 1 T: For each training example (x i, y i ) D: If y i w T x i 0 X update w w + r y i x i a a + w 3. Return a This is the simplest version of the averaged perceptron There are some easy programming tricks to make sure that a is also updated only when there is an error Prediction: sgn(a T x) = sgn kx c i w i > x i=1 79

80 Averaged Perceptron Given a training set D = {(x i, y i )}, x i R n, y i {-1,1} 1. Initialize w = 0 R n and a = 0 R n 2. For epoch = 1 T: For each training example (x i, y i ) D: If y i w T x i 0 X update w w + r y i x i a a + w 3. Return a This is the simplest version of the averaged perceptron There are some easy programming tricks to make sure that a is also updated only when there is an error Extremely popular Prediction: sgn(a T x) = sgn kx c i w i > x i=1 If you want to use the Perceptron algorithm, use averaging 80

81 Z Question: What is the difference? Voted: X X sgn kx c i sgn(w i > x) X i=1 X Averaged: k k sgn kx i=1 c i w > i x w > 1 x = s 1, w > 2 x = s 2, w > 3 x = s 3 c 1 = c 2 = c 3 =1 Averaged: Voted: s 1 + s 2 + s 3 0 Any two are positive 81

82 Summary: Perceptron Online learning algorithm, very widely used, easy to implement Additive updates to weights Geometric interpretation Mistake bound Practical variants abound You should be able to implement the Perceptron algorithm 82

Linear Classifiers: Expressiveness

Linear Classifiers: Expressiveness Machine Learning Spring 2018 The slides are mainly from Vivek Srikumar 1 Lecture outline Linear classifiers: Introduction What functions do linear classifiers express?