The Perceptron Algorithm
|
|
- Derick Mosley
- 5 years ago
- Views:
Transcription
1 The Perceptron Algorithm Machine Learning Spring 2018 The slides are mainly from Vivek Srikumar 1
2 Outline The Perceptron Algorithm Perceptron Mistake Bound Variants of Perceptron 2
3 Where are we? The Perceptron Algorithm Perceptron Mistake Bound Variants of Perceptron 3
4 Recall: Linear Classifiers Input is a n dimensional vector x Output is a label y {-1, 1} Linear Threshold Units (LTUs) classify an example x using the following classification rule Output = sgn(w T x + b) = sgn(b +åw i x i ) w T x + b 0 Predict y = 1 w T x + b < 0 Predict y = -1 b is called the bias term 4
5 Recall: Linear Classifiers Input is a n dimensional vector x Output is a label y {-1, 1} Linear Threshold Units (LTUs) classify an example x using the following classification rule Output = sgn(w T x + b) = sgn(b +åw i x i ) w T x + b 0 ) Predict y = 1 w T x + b < 0 ) Predict y = -1 sgn & ' & ( & ) & * & + &, & - &. / b is called the ' / bias ( / term ) / * / + /, / - /
6 The geometry of a linear classifier b +w 1 x 1 + w 2 x 2 =0 sgn(b +w 1 x 1 + w 2 x 2 ) We only care about the sign, not the magnitude + [w 1 w 2 ] x 1 In n dimensions, a linear classifier represents a hyperplane that separates the space into two half-spaces x 2 6
7 The Perceptron 7
8 The hype The New Yorker, December 6, 1958 P. 44 The New York Times, July The IBM 704 computer 8
9 The Perceptron algorithm Rosenblatt 1958 The goal is to find a separating hyperplane For separable data, guaranteed to find one An online algorithm Processes one example at a time Several variants exist (will discuss briefly at towards the end) 9
10 The Perceptron algorithm Input: A sequence of training examples (x 1, y 1 ), (x 2, y 2 ),! where all x i R n, y i {-1,1} Initialize w 0 = 0 R n For each training example (x i, y i ): Predict y = sgn(w tt x i ) If y i y : Update w t+1 w t + r (y i x i ) Return final weight vector 10
11 The Perceptron algorithm Input: A sequence of training examples (x 1, y 1 ), (x 2, y 2 ),! where all x i R n, y i {-1,1} Initialize w 0 = 0 R n For each training example (x i, y i ): Predict y = sgn(w tt x i ) If y i y : Update w t+1 w t + r (y i x i ) Return final weight vector Remember: Prediction = sgn(w T x) There is typically a bias term also (w T x + b), but the bias may be treated as a constant feature and folded into w 11
12 The Perceptron algorithm Input: A sequence of training examples (x 1, y 1 ), (x 2, y 2 ),! where all x i R n, y i {-1,1} Initialize w 0 = 0 R n For each training example (x i, y i ): Predict y = sgn(w tt x i ) If y i y : Update w t+1 w t + r (y i x i ) Return final weight vector Remember: Prediction = sgn(w T x) There is typically a bias term also (w T x + b), but the bias may be treated as a constant feature and folded into w Footnote: For some algorithms it is mathematically easier to represent False as -1, and at other times, as 0. For the Perceptron algorithm, treat -1 as false and +1 as true. 12
13 The Perceptron algorithm Input: A sequence of training examples (x 1, y 1 ), (x 2, y 2 ),! where all x i R n, y i {-1,1} Initialize w 0 = 0 R n For each training example (x i, y i ): Predict y = sgn(w tt x i ) If y i y : Update w t+1 w t + r (y i x i ) Return final weight vector Mistake on positive: w t+1 w t + r x i Mistake on negative: w t+1 w t - r x i 13
14 The Perceptron algorithm Input: A sequence of training examples (x 1, y 1 ), (x 2, y 2 ),! where all x i R n, y i {-1,1} Initialize w 0 = 0 R n For each training example (x i, y i ): Predict y = sgn(w tt x i ) If y i y : Update w t+1 w t + r (y i x i ) Return final weight vector Mistake on positive: w t+1 w t + r x i Mistake on negative: w t+1 w t - r x i r is the learning rate, a small positive number less than 1 14
15 The Perceptron algorithm Input: A sequence of training examples (x 1, y 1 ), (x 2, y 2 ),! where all x i R n, y i {-1,1} Initialize w 0 = 0 R n For each training example (x i, y i ): Predict y = sgn(w tt x i ) If y i y : Update w t+1 w t + r (y i x i ) Return final weight vector Mistake on positive: w t+1 w t + r x i Mistake on negative: w t+1 w t - r x i r is the learning rate, a small positive number less than 1 Update only on error. A mistakedriven algorithm 15
16 The Perceptron algorithm Input: A sequence of training examples (x 1, y 1 ), (x 2, y 2 ),! where all x i R n, y i {-1,1} Initialize w 0 = 0 R n For each training example (x i, y i ): Predict y = sgn(w tt x i ) If y i y : Update w t+1 w t + r (y i x i ) Return final weight vector Mistake on positive: w t+1 w t + r x i Mistake on negative: w t+1 w t - r x i r is the learning rate, a small positive number less than 1 Update only on error. A mistakedriven algorithm This is the simplest version. We will see more robust versions at the end 16
17 The Perceptron algorithm Input: A sequence of training examples (x 1, y 1 ), (x 2, y 2 ),! where all x i R n, y i {-1,1} Initialize w 0 = 0 R n For each training example (x i, y i ): Predict y = sgn(w tt x i ) If y i y : Update w t+1 w t + r (y i x i ) Return final weight vector Mistake on positive: w t+1 w t + r x i Mistake on negative: w t+1 w t - r x i r is the learning rate, a small positive number less than 1 Update only on error. A mistakedriven algorithm This is the simplest version. We will see more robust versions at the end Mistake can be written as y i w tt x i 0 17
18 Intuition behind the update Mistake on positive: w t+1 w t + r x i Mistake on negative: w t+1 w t - r x i Suppose we have made a mistake on a positive example That is, y = +1 and w tt x <0 Call the new weight vector w t+1 = w t + x (say r = 1) The new dot product will be w t+1t x = (w t + x) T x = w tt x + x T x > w tt x For a positive example, the Perceptron update will increase the score assigned to the same input Similar reasoning for negative examples 18
19 Geometry of the perceptron update Predict Mistake on positive: w t+1 w t + r x i Mistake on negative: w t+1 w t - r x i w old 19
20 Geometry of the perceptron update Predict Mistake on positive: w t+1 w t + r x i Mistake on negative: w t+1 w t - r x i w old (x, +1) 20
21 Geometry of the perceptron update Predict Mistake on positive: w t+1 w t + r x i Mistake on negative: w t+1 w t - r x i w old (x, +1) For a mistake on a positive example 21
22 Geometry of the perceptron update Predict Update Mistake on positive: w t+1 w t + r x i Mistake on negative: w t+1 w t - r x i w old w w + x (x, +1) (x, +1) For a mistake on a positive example 22
23 Geometry of the perceptron update Predict Update Mistake on positive: w t+1 w t + r x i Mistake on negative: w t+1 w t - r x i w old x w w + x (x, +1) (x, +1) For a mistake on a positive example 23
24 Geometry of the perceptron update Predict Update Mistake on positive: w t+1 w t + r x i Mistake on negative: w t+1 w t - r x i w old x w w + x (x, +1) (x, +1) For a mistake on a positive example 24
25 Mistake on positive: w t+1 w t + r x i Mistake on negative: w t+1 w t - r x i Geometry of the perceptron update Predict Update After w old x w w + x w new (x, +1) (x, +1) (x, +1) For a mistake on a positive example 25
26 Geometry of the perceptron update Predict w old 26
27 Geometry of the perceptron update Predict (x, -1) w old For a mistake on a negative example 27
28 Geometry of the perceptron update Predict Update w w - x (x, -1) w old (x, -1) - x For a mistake on a negative example 28
29 Geometry of the perceptron update Predict Update w w - x (x, -1) w old (x, -1) - x For a mistake on a negative example 29
30 Geometry of the perceptron update Predict Update w w - x (x, -1) w old (x, -1) - x For a mistake on a negative example 30
31 Geometry of the perceptron update Predict Update After w w - x (x, -1) w old (x, -1) - x (x, -1) w new For a mistake on a negative example 31
32 Perceptron Learnability Obviously Perceptron cannot learn what it cannot represent Only linearly separable functions Minsky and Papert (1969) wrote an influential book demonstrating Perceptron s representational limitations Parity functions can t be learned (XOR) But we already know that XOR is not linearly separable Feature transformation trick 32
33 What you need to know The Perceptron algorithm The geometry of the update What can it represent 33
34 Where are we? The Perceptron Algorithm Perceptron Mistake Bound Variants of Perceptron 34
35 Convergence Convergence theorem If there exist a set of weights that are consistent with the data (i.e. the data is linearly separable), the perceptron algorithm will converge. 35
36 Convergence Convergence theorem If there exist a set of weights that are consistent with the data (i.e. the data is linearly separable), the perceptron algorithm will converge. Cycling theorem If the training data is not linearly separable, then the learning algorithm will eventually repeat the same set of weights and enter an infinite loop 36
37 Margin The margin of a hyperplane for a dataset is the distance between the hyperplane and the data point nearest to it Margin with respect to this hyperplane 37
38 Margin The margin of a hyperplane for a dataset is the distance between the hyperplane and the data point nearest to it. The margin of a data set (!) is the maximum margin possible for that dataset using any weight vector Margin of the data 38
39 Mistake Bound Theorem [Novikoff 1962, Block 1962] Let {(x 1, y 1 ), (x 2, y 2 ),!, (x m, y m )} be a sequence of training examples such that for all i, the feature vector x i R n, x i R and the label y i {-1, +1}. Suppose there exists a unit vector u 2 < n (i.e u = 1) such that for some 2 < and > 0 we have y i (u T x i ). Then, the perceptron algorithm will make at most (R/ ) 2 mistakes on the training sequence. 39
40 Mistake Bound Theorem [Novikoff 1962, Block 1962] Let {(x 1, y 1 ), (x 2, y 2 ),!, (x m, y m )} be a sequence of training examples such that for all i, the feature vector x i R n, x i R and the label y i {-1, +1}. Suppose there exists a unit vector u R n (i.e u = 1) such that for some $ R and $ > 0 we have y i (u T x i ) $. Then, the perceptron algorithm will make at most (R/ ) 2 mistakes on the training sequence. 40
41 Mistake Bound Theorem [Novikoff 1962, Block 1962] Let {(x 1, y 1 ), (x 2, y 2 ),!, (x m, y m )} be a sequence of training examples such that for all i, the feature vector x i R n, x i R and the label y i {-1, +1}. Suppose there exists a unit vector u R n (i.e u = 1) such that for some $ R and $ > 0 we have y i (u T x i ) $. Then, the perceptron algorithm will make at most (R/ $) 2 mistakes on the training sequence. 41
42 Mistake Bound Theorem [Novikoff 1962, Block 1962] Let {(x 1, y 1 ), (x 2, y 2 ),!, (x m, y m )} be a sequence of training examples such that for all i, the feature vector x i R n, x i R and the label y i {-1, +1}. We can always find such an R. Just look for the farthest data point from the origin. Suppose there exists a unit vector u R n (i.e u = 1) such that for some $ R and $ > 0 we have y i (u T x i ) $. Then, the perceptron algorithm will make at most (R/ $) 2 mistakes on the training sequence. 42
43 Mistake Bound Theorem [Novikoff 1962, Block 1962] Let {(x 1, y 1 ), (x 2, y 2 ),!, (x m, y m )} be a sequence of training examples such that for all i, the feature vector x i R n, x i R and the label y i {-1, +1}. Suppose there exists a unit vector u R n (i.e u = 1) such that for some! R and! > 0 we have y i (u T x i )!. Then, the perceptron algorithm will make at most (R/!) 2 mistakes on the training sequence. The data and u have a margin!. Importantly, the data is separable.! is the complexity parameter that defines the separability of data. 43
44 Mistake Bound Theorem [Novikoff 1962, Block 1962] Let {(x 1, y 1 ), (x 2, y 2 ),!, (x m, y m )} be a sequence of training examples such that for all i, the feature vector x i R n, x i R and the label y i {-1, +1}. Suppose there exists a unit vector u R n (i.e u = 1) such that for some! R and! > 0 we have y i (u T x i )!. Then, the perceptron algorithm will make at most (R/!) 2 mistakes on the training sequence. If u hadn t been a unit vector, then we could scale! in the mistake bound. This will change the final mistake bound to ( u R/!) 2 44
45 Mistake Bound Theorem [Novikoff 1962, Block 1962] Let {(x 1, y 1 ), (x 2, y 2 ),!, (x m, y m )} be a sequence of training examples such that for all i, the feature vector x i R n, x i R and the label y i {-1, +1}. Suppose we have a binary classification dataset with n dimensional inputs. Suppose there exists a unit vector u R n (i.e u = 1) such that for some $ R and $ > 0 we have y i (u T x i ) $. If the data is separable, Then, the perceptron algorithm will make at most (R/ $) 2 mistakes on the training sequence. then the Perceptron algorithm will find a separating hyperplane after making a finite number of mistakes 45
46 Proof (preliminaries) Receive an input (x i, y i ) if sgn(w tt x i ) y i : Update w t+1 w t + y i x i The setting Initial weight vector w is all zeros Learning rate = 1 Effectively scales inputs, but does not change the behavior All training examples are contained in a ball of size R x i R The training data is separable by margin " using a unit vector u y i (u T x i ) " 46
47 Proof (1/3) Receive an input (x i, y i ) if sgn(w tt x i ) y i : Update w t+1 w t + y i x i 1. Claim: After t mistakes, u T w t t " 47
48 Proof (1/3) Receive an input (x i, y i ) if sgn(w tt x i ) y i : Update w t+1 w t + y i x i 1. Claim: After t mistakes, u T w t t " Because the data is separable by a margin " 48
49 Proof (1/3) Receive an input (x i, y i ) if sgn(w tt x i ) y i : Update w t+1 w t + y i x i 1. Claim: After t mistakes, u T w t t " Because the data is separable by a margin " Because w 0 = 0 (i.e u T w 0 = 0), straightforward induction gives us u T w t t " 49
50 Proof (2/3) Receive an input (x i, y i ) if sgn(w tt x i ) y i : Update w t+1 w t + y i x i 2. Claim: After t mistakes, w t 2 tr 2 50
51 Proof (2/3) Receive an input (x i, y i ) if sgn(w tt x i ) y i : Update w t+1 w t + y i x i 2. Claim: After t mistakes, w t 2 tr 2 The weight is updated only when there is a mistake. That is when y i w tt x i < 0. x i R, by definition of R 51
52 Proof (2/3) Receive an input (x i, y i ) if sgn(w tt x i ) y i : Update w t+1 w t + y i x i 2. Claim: After t mistakes, w t 2 tr 2 Because w 0 = 0 (i.e u T w 0 = 0), straightforward induction gives us w t 2 tr 2 52
53 Proof (3/3) What we know: 1. After t mistakes, u T w t t" 2. After t mistakes, w t 2 tr 2 53
54 Proof (3/3) What we know: 1. After t mistakes, u T w t t" 2. After t mistakes, w t 2 tr 2 From (2) 54
55 Proof (3/3) What we know: 1. After t mistakes, u T w t t" 2. After t mistakes, w t 2 tr 2 From (2) u T w t = u w t cos(<angle between them>) But u = 1 and cosine is less than 1 So u T w t w t 55
56 Proof (3/3) What we know: 1. After t mistakes, u T w t t" 2. After t mistakes, w t 2 tr 2 From (2) u T w t = u w t cos(<angle between them>) But u = 1 and cosine is less than 1 So u T w t w t (Cauchy-Schwarz inequality) 56
57 Proof (3/3) What we know: 1. After t mistakes, u T w t t# 2. After t mistakes, w t 2 tr 2 From (2) From (1) u T w t = u w t cos(<angle between them>) But u = 1 and cosine is less than 1 So u T w t w t 57
58 Proof (3/3) What we know: 1. After t mistakes, u T w t t" 2. After t mistakes, w t 2 tr 2 Number of mistakes 58
59 Proof (3/3) What we know: 1. After t mistakes, u T w t t" 2. After t mistakes, w t 2 tr 2 Number of mistakes Bounds the total number of mistakes! 59
60 Mistake Bound Theorem [Novikoff 1962, Block 1962] Let {(x 1, y 1 ), (x 2, y 2 ),!, (x m, y m )} be a sequence of training examples such that for all i, the feature vector x i R n, x i R and the label y i {-1, +1}. Suppose there exists a unit vector u R n (i.e u = 1) such that for some $ R and $ > 0 we have y i (u T x i ) $. Then, the perceptron algorithm will make at most (R/ $) 2 mistakes on the training sequence. 60
61 Mistake Bound Theorem [Novikoff 1962, Block 1962] 61
62 Mistake Bound Theorem [Novikoff 1962, Block 1962] X R 62
63 Mistake Bound Theorem [Novikoff 1962, Block 1962] X X R 63
64 Mistake Bound Theorem [Novikoff 1962, Block 1962] X X R Number of mistakes 64
65 The Perceptron Mistake bound Number of mistakes Exercises: How many mistakes will the Perceptron algorithm make for disjunctions with n attributes? What are R and!? How many mistakes will the Perceptron algorithm make for k- disjunctions with n attributes? 65
66 What you need to know What is the perceptron mistake bound? How to prove it 66
67 Where are we? The Perceptron Algorithm Perceptron Mistake Bound Variants of Perceptron 67
68 Practical use of the Perceptron algorithm 1. Using the Perceptron algorithm with a finite dataset 2. Margin Perceptron 3. Voting and Averaging 68
69 1. The standard algorithm Given a training set D = {(x i, y i )}, x i R n, y i {-1,1} 1. Initialize w = 0 R n 2. For epoch = 1 T: 1. Shuffle the data 2. For each training example (x i, y i ) D: If y i w T x i 0, update w w + r y i x i 3. Return w Prediction: sgn(w T x) 69
70 1. The standard algorithm Given a training set D = {(x i, y i )}, x i R n, y i {-1,1} 1. Initialize w = 0 R n 2. For epoch = 1 T: 1. Shuffle the data 2. For each training example (x i, y i ) D: If y i w T x i 0, update w w + r y i x i 3. Return w Prediction: sgn(w T x) T is a hyper-parameter to the algorithm Another way of writing that there is an error 70
71 2. Margin Perceptron Which hyperplane is better? h 1 h 2 71
72 2. Margin Perceptron Which hyperplane is better? h 1 h 2 72
73 2. Margin Perceptron Which hyperplane is better? h 1 h 2 The farther from the data points, the less chance to make wrong prediction 73
74 2. Margin Perceptron Perceptron makes updates only when the prediction is X incorrect y i w T x i < 0 X What if the prediction is close to being incorrect? That is, Pick a positive! and update when y i w > x i kwk < Can generalize better, but need to choose Why is this a good idea? 74
75 2. Margin Perceptron Perceptron makes updates only when the prediction is X incorrect y i w T x i < 0 X What if the prediction is close to being incorrect? That is, Pick a positive! and update when y i w > x i kwk < Can generalize better, but need to choose Why is this a good idea? Intentionally set a large margin 75
76 3. Voting and Averaging What if data is not linearly separable? Finding a hyperplane with minimum mistakes is NP hard 76
77 Voted Perceptron Given a training set D = {(x i, y i )}, x i R n, y i {-1,1} 1. Initialize w = 0 R n and a = 0 R n 2. For epoch = 1 T: For each training example (x i, y i ) D: If y i w T x i 0 update w m+1 w m + r y i x i m=m+1 Z C m = 1 Else C m = C m Return (w 1, c 1 ), (w 2, c 2 ),, (w k, C k ) kx Prediction: sgn c i sgn(w i > x) i=1 77
78 Averaged Perceptron Given a training set D = {(x i, y i )}, x i R n, y i {-1,1} 1. Initialize w = 0 R n and a = 0 R n 2. For epoch = 1 T: For each training example (x i, y i ) D: If y i w T x i 0 X update w w + r y i x i a a + w 3. Return a Prediction: sgn(a T x) = sgn kx c i w i > x i=1 78
79 Averaged Perceptron Given a training set D = {(x i, y i )}, x i R n, y i {-1,1} 1. Initialize w = 0 R n and a = 0 R n 2. For epoch = 1 T: For each training example (x i, y i ) D: If y i w T x i 0 X update w w + r y i x i a a + w 3. Return a This is the simplest version of the averaged perceptron There are some easy programming tricks to make sure that a is also updated only when there is an error Prediction: sgn(a T x) = sgn kx c i w i > x i=1 79
80 Averaged Perceptron Given a training set D = {(x i, y i )}, x i R n, y i {-1,1} 1. Initialize w = 0 R n and a = 0 R n 2. For epoch = 1 T: For each training example (x i, y i ) D: If y i w T x i 0 X update w w + r y i x i a a + w 3. Return a This is the simplest version of the averaged perceptron There are some easy programming tricks to make sure that a is also updated only when there is an error Extremely popular Prediction: sgn(a T x) = sgn kx c i w i > x i=1 If you want to use the Perceptron algorithm, use averaging 80
81 Z Question: What is the difference? Voted: X X sgn kx c i sgn(w i > x) X i=1 X Averaged: k k sgn kx i=1 c i w > i x w > 1 x = s 1, w > 2 x = s 2, w > 3 x = s 3 c 1 = c 2 = c 3 =1 Averaged: Voted: s 1 + s 2 + s 3 0 Any two are positive 81
82 Summary: Perceptron Online learning algorithm, very widely used, easy to implement Additive updates to weights Geometric interpretation Mistake bound Practical variants abound You should be able to implement the Perceptron algorithm 82
Linear Classifiers: Expressiveness
Linear Classifiers: Expressiveness Machine Learning Spring 2018 The slides are mainly from Vivek Srikumar 1 Lecture outline Linear classifiers: Introduction What functions do linear classifiers express?
More informationCS 446: Machine Learning Lecture 4, Part 2: On-Line Learning
CS 446: Machine Learning Lecture 4, Part 2: On-Line Learning 0.1 Linear Functions So far, we have been looking at Linear Functions { as a class of functions which can 1 if W1 X separate some data and not
More informationPerceptron. Subhransu Maji. CMPSCI 689: Machine Learning. 3 February February 2015
Perceptron Subhransu Maji CMPSCI 689: Machine Learning 3 February 2015 5 February 2015 So far in the class Decision trees Inductive bias: use a combination of small number of features Nearest neighbor
More informationPattern Recognition and Machine Learning. Perceptrons and Support Vector machines
Pattern Recognition and Machine Learning James L. Crowley ENSIMAG 3 - MMIS Fall Semester 2016 Lessons 6 10 Jan 2017 Outline Perceptrons and Support Vector machines Notation... 2 Perceptrons... 3 History...3
More informationPerceptron (Theory) + Linear Regression
10601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Perceptron (Theory) Linear Regression Matt Gormley Lecture 6 Feb. 5, 2018 1 Q&A
More informationLinear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.
Preliminaries Linear models: the perceptron and closest centroid algorithms Chapter 1, 7 Definition: The Euclidean dot product beteen to vectors is the expression d T x = i x i The dot product is also
More informationMore about the Perceptron
More about the Perceptron CMSC 422 MARINE CARPUAT marine@cs.umd.edu Credit: figures by Piyush Rai and Hal Daume III Recap: Perceptron for binary classification Classifier = hyperplane that separates positive
More informationOnline Learning, Mistake Bounds, Perceptron Algorithm
Online Learning, Mistake Bounds, Perceptron Algorithm 1 Online Learning So far the focus of the course has been on batch learning, where algorithms are presented with a sample of training data, from which
More informationThe Perceptron algorithm
The Perceptron algorithm Tirgul 3 November 2016 Agnostic PAC Learnability A hypothesis class H is agnostic PAC learnable if there exists a function m H : 0,1 2 N and a learning algorithm with the following
More informationCOMS 4771 Introduction to Machine Learning. Nakul Verma
COMS 4771 Introduction to Machine Learning Nakul Verma Announcements HW1 due next lecture Project details are available decide on the group and topic by Thursday Last time Generative vs. Discriminative
More informationMulti-class SVMs. Lecture 17: Aykut Erdem April 2016 Hacettepe University
Multi-class SVMs Lecture 17: Aykut Erdem April 2016 Hacettepe University Administrative We will have a make-up lecture on Saturday April 23, 2016. Project progress reports are due April 21, 2016 2 days
More informationThe Perceptron Algorithm 1
CS 64: Machine Learning Spring 5 College of Computer and Information Science Northeastern University Lecture 5 March, 6 Instructor: Bilal Ahmed Scribe: Bilal Ahmed & Virgil Pavlu Introduction The Perceptron
More informationLecture 4: Linear predictors and the Perceptron
Lecture 4: Linear predictors and the Perceptron Introduction to Learning and Analysis of Big Data Kontorovich and Sabato (BGU) Lecture 4 1 / 34 Inductive Bias Inductive bias is critical to prevent overfitting.
More informationSupport vector machines Lecture 4
Support vector machines Lecture 4 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos Guestrin Q: What does the Perceptron mistake bound tell us? Theorem: The
More informationWarm up: risk prediction with logistic regression
Warm up: risk prediction with logistic regression Boss gives you a bunch of data on loans defaulting or not: {(x i,y i )} n i= x i 2 R d, y i 2 {, } You model the data as: P (Y = y x, w) = + exp( yw T
More informationLeast Mean Squares Regression
Least Mean Squares Regression Machine Learning Spring 2018 The slides are mainly from Vivek Srikumar 1 Lecture Overview Linear classifiers What functions do linear classifiers express? Least Squares Method
More information1 Learning Linear Separators
10-601 Machine Learning Maria-Florina Balcan Spring 2015 Plan: Perceptron algorithm for learning linear separators. 1 Learning Linear Separators Here we can think of examples as being from {0, 1} n or
More informationCPSC 340: Machine Learning and Data Mining
CPSC 340: Machine Learning and Data Mining Linear Classifiers: predictions Original version of these slides by Mark Schmidt, with modifications by Mike Gelbart. 1 Admin Assignment 4: Due Friday of next
More informationFrom Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018
From Binary to Multiclass Classification CS 6961: Structured Prediction Spring 2018 1 So far: Binary Classification We have seen linear models Learning algorithms Perceptron SVM Logistic Regression Prediction
More informationMIRA, SVM, k-nn. Lirong Xia
MIRA, SVM, k-nn Lirong Xia Linear Classifiers (perceptrons) Inputs are feature values Each feature has a weight Sum is the activation activation w If the activation is: Positive: output +1 Negative, output
More informationCS 188: Artificial Intelligence. Outline
CS 188: Artificial Intelligence Lecture 21: Perceptrons Pieter Abbeel UC Berkeley Many slides adapted from Dan Klein. Outline Generative vs. Discriminative Binary Linear Classifiers Perceptron Multi-class
More informationMachine Learning A Geometric Approach
Machine Learning A Geometric Approach Linear Classification: Perceptron Professor Liang Huang some slides from Alex Smola (CMU) Perceptron Frank Rosenblatt deep learning multilayer perceptron linear regression
More informationLinear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))
Linear smoother ŷ = S y where s ij = s ij (x) e.g. s ij = diag(l i (x)) 2 Online Learning: LMS and Perceptrons Partially adapted from slides by Ryan Gabbard and Mitch Marcus (and lots original slides by
More informationName (NetID): (1 Point)
CS446: Machine Learning (D) Spring 2017 March 16 th, 2017 This is a closed book exam. Everything you need in order to solve the problems is supplied in the body of this exam. This exam booklet contains
More informationSimple Neural Nets For Pattern Classification
CHAPTER 2 Simple Neural Nets For Pattern Classification Neural Networks General Discussion One of the simplest tasks that neural nets can be trained to perform is pattern classification. In pattern classification
More informationSupport Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017
Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem
More informationLinear Classifiers and the Perceptron
Linear Classifiers and the Perceptron William Cohen February 4, 2008 1 Linear classifiers Let s assume that every instance is an n-dimensional vector of real numbers x R n, and there are only two possible
More informationLearning Methods for Linear Detectors
Intelligent Systems: Reasoning and Recognition James L. Crowley ENSIMAG 2 / MoSIG M1 Second Semester 2011/2012 Lesson 20 27 April 2012 Contents Learning Methods for Linear Detectors Learning Linear Detectors...2
More informationMidterm: CS 6375 Spring 2015 Solutions
Midterm: CS 6375 Spring 2015 Solutions The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for an
More informationEE 511 Online Learning Perceptron
Slides adapted from Ali Farhadi, Mari Ostendorf, Pedro Domingos, Carlos Guestrin, and Luke Zettelmoyer, Kevin Jamison EE 511 Online Learning Perceptron Instructor: Hanna Hajishirzi hannaneh@washington.edu
More informationKernelized Perceptron Support Vector Machines
Kernelized Perceptron Support Vector Machines Emily Fox University of Washington February 13, 2017 What is the perceptron optimizing? 1 The perceptron algorithm [Rosenblatt 58, 62] Classification setting:
More informationComputational Learning Theory: Shattering and VC Dimensions. Machine Learning. Spring The slides are mainly from Vivek Srikumar
Computational Learning Theory: Shattering and VC Dimensions Machine Learning Spring 2018 The slides are mainly from Vivek Srikumar 1 This lecture: Computational Learning Theory The Theory of Generalization
More information1 Learning Linear Separators
8803 Machine Learning Theory Maria-Florina Balcan Lecture 3: August 30, 2011 Plan: Perceptron algorithm for learning linear separators. 1 Learning Linear Separators Here we can think of examples as being
More informationCSE 151 Machine Learning. Instructor: Kamalika Chaudhuri
CSE 151 Machine Learning Instructor: Kamalika Chaudhuri Linear Classification Given labeled data: (xi, feature vector yi) label i=1,..,n where y is 1 or 1, find a hyperplane to separate from Linear Classification
More informationSupport Vector Machines. Machine Learning Fall 2017
Support Vector Machines Machine Learning Fall 2017 1 Where are we? Learning algorithms Decision Trees Perceptron AdaBoost 2 Where are we? Learning algorithms Decision Trees Perceptron AdaBoost Produce
More informationLinear models and the perceptron algorithm
8/5/6 Preliminaries Linear models and the perceptron algorithm Chapters, 3 Definition: The Euclidean dot product beteen to vectors is the expression dx T x = i x i The dot product is also referred to as
More informationNeural Networks for Machine Learning. Lecture 2a An overview of the main types of neural network architecture
Neural Networks for Machine Learning Lecture 2a An overview of the main types of neural network architecture Geoffrey Hinton with Nitish Srivastava Kevin Swersky Feed-forward neural networks These are
More informationLecture 9: Large Margin Classifiers. Linear Support Vector Machines
Lecture 9: Large Margin Classifiers. Linear Support Vector Machines Perceptrons Definition Perceptron learning rule Convergence Margin & max margin classifiers (Linear) support vector machines Formulation
More information10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers
Computational Methods for Data Analysis Massimo Poesio SUPPORT VECTOR MACHINES Support Vector Machines Linear classifiers 1 Linear Classifiers denotes +1 denotes -1 w x + b>0 f(x,w,b) = sign(w x + b) How
More informationVote. Vote on timing for night section: Option 1 (what we have now) Option 2. Lecture, 6:10-7:50 25 minute dinner break Tutorial, 8:15-9
Vote Vote on timing for night section: Option 1 (what we have now) Lecture, 6:10-7:50 25 minute dinner break Tutorial, 8:15-9 Option 2 Lecture, 6:10-7 10 minute break Lecture, 7:10-8 10 minute break Tutorial,
More informationName (NetID): (1 Point)
CS446: Machine Learning Fall 2016 October 25 th, 2016 This is a closed book exam. Everything you need in order to solve the problems is supplied in the body of this exam. This exam booklet contains four
More informationLast updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
Last updated: Oct 22, 2012 LINEAR CLASSIFIERS Problems 2 Please do Problem 8.3 in the textbook. We will discuss this in class. Classification: Problem Statement 3 In regression, we are modeling the relationship
More informationPreliminaries. Definition: The Euclidean dot product between two vectors is the expression. i=1
90 8 80 7 70 6 60 0 8/7/ Preliminaries Preliminaries Linear models and the perceptron algorithm Chapters, T x + b < 0 T x + b > 0 Definition: The Euclidean dot product beteen to vectors is the expression
More informationThe perceptron learning algorithm is one of the first procedures proposed for learning in neural network models and is mostly credited to Rosenblatt.
1 The perceptron learning algorithm is one of the first procedures proposed for learning in neural network models and is mostly credited to Rosenblatt. The algorithm applies only to single layer models
More informationMistake Bound Model, Halving Algorithm, Linear Classifiers, & Perceptron
Stat 928: Statistical Learning Theory Lecture: 18 Mistake Bound Model, Halving Algorithm, Linear Classifiers, & Perceptron Instructor: Sham Kakade 1 Introduction This course will be divided into 2 parts.
More informationMachine Learning Basics
Security and Fairness of Deep Learning Machine Learning Basics Anupam Datta CMU Spring 2019 Image Classification Image Classification Image classification pipeline Input: A training set of N images, each
More informationLogistic Regression Logistic
Case Study 1: Estimating Click Probabilities L2 Regularization for Logistic Regression Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 10 th,
More informationLearning Deep Architectures for AI. Part I - Vijay Chakilam
Learning Deep Architectures for AI - Yoshua Bengio Part I - Vijay Chakilam Chapter 0: Preliminaries Neural Network Models The basic idea behind the neural network approach is to model the response as a
More informationLecture 6. Notes on Linear Algebra. Perceptron
Lecture 6. Notes on Linear Algebra. Perceptron COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Andrey Kan Copyright: University of Melbourne This lecture Notes on linear algebra Vectors
More informationThe Perceptron Algorithm
The Perceptron Algorithm Greg Grudic Greg Grudic Machine Learning Questions? Greg Grudic Machine Learning 2 Binary Classification A binary classifier is a mapping from a set of d inputs to a single output
More informationCOMP 875 Announcements
Announcements Tentative presentation order is out Announcements Tentative presentation order is out Remember: Monday before the week of the presentation you must send me the final paper list (for posting
More informationFoundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research
Foundations of Machine Learning On-Line Learning Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Motivation PAC learning: distribution fixed over time (training and test). IID assumption.
More informationKernel Methods and Support Vector Machines
Kernel Methods and Support Vector Machines Oliver Schulte - CMPT 726 Bishop PRML Ch. 6 Support Vector Machines Defining Characteristics Like logistic regression, good for continuous input features, discrete
More informationInexact Search is Good Enough
Inexact Search is Good Enough Advanced Machine Learning for NLP Jordan Boyd-Graber MATHEMATICAL TREATMENT Advanced Machine Learning for NLP Boyd-Graber Inexact Search is Good Enough 1 of 1 Preliminaries:
More informationNeural Networks biological neuron artificial neuron 1
Neural Networks biological neuron artificial neuron 1 A two-layer neural network Output layer (activation represents classification) Weighted connections Hidden layer ( internal representation ) Input
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table
More informationKernels. Machine Learning CSE446 Carlos Guestrin University of Washington. October 28, Carlos Guestrin
Kernels Machine Learning CSE446 Carlos Guestrin University of Washington October 28, 2013 Carlos Guestrin 2005-2013 1 Linear Separability: More formally, Using Margin Data linearly separable, if there
More informationLecture 3: Multiclass Classification
Lecture 3: Multiclass Classification Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Some slides are adapted from Vivek Skirmar and Dan Roth CS6501 Lecture 3 1 Announcement v Please enroll in
More informationMachine Learning Lecture 5
Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory
More informationOnline Learning. Jordan Boyd-Graber. University of Colorado Boulder LECTURE 21. Slides adapted from Mohri
Online Learning Jordan Boyd-Graber University of Colorado Boulder LECTURE 21 Slides adapted from Mohri Jordan Boyd-Graber Boulder Online Learning 1 of 31 Motivation PAC learning: distribution fixed over
More informationMultilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)
Multilayer Neural Networks (sometimes called Multilayer Perceptrons or MLPs) Linear separability Hyperplane In 2D: w x + w 2 x 2 + w 0 = 0 Feature x 2 = w w 2 x w 0 w 2 Feature 2 A perceptron can separate
More informationLinear Classifiers. Michael Collins. January 18, 2012
Linear Classifiers Michael Collins January 18, 2012 Today s Lecture Binary classification problems Linear classifiers The perceptron algorithm Classification Problems: An Example Goal: build a system that
More informationMachine Learning for NLP
Machine Learning for NLP Uppsala University Department of Linguistics and Philology Slides borrowed from Ryan McDonald, Google Research Machine Learning for NLP 1(50) Introduction Linear Classifiers Classifiers
More informationSections 18.6 and 18.7 Analysis of Artificial Neural Networks
Sections 18.6 and 18.7 Analysis of Artificial Neural Networks CS4811 - Artificial Intelligence Nilufer Onder Department of Computer Science Michigan Technological University Outline Univariate regression
More informationThe Perceptron Algorithm, Margins
The Perceptron Algorithm, Margins MariaFlorina Balcan 08/29/2018 The Perceptron Algorithm Simple learning algorithm for supervised classification analyzed via geometric margins in the 50 s [Rosenblatt
More informationLinear and Logistic Regression. Dr. Xiaowei Huang
Linear and Logistic Regression Dr. Xiaowei Huang https://cgi.csc.liv.ac.uk/~xiaowei/ Up to now, Two Classical Machine Learning Algorithms Decision tree learning K-nearest neighbor Model Evaluation Metrics
More informationPredicting Sequences: Structured Perceptron. CS 6355: Structured Prediction
Predicting Sequences: Structured Perceptron CS 6355: Structured Prediction 1 Conditional Random Fields summary An undirected graphical model Decompose the score over the structure into a collection of
More informationMulticlass Classification-1
CS 446 Machine Learning Fall 2016 Oct 27, 2016 Multiclass Classification Professor: Dan Roth Scribe: C. Cheng Overview Binary to multiclass Multiclass SVM Constraint classification 1 Introduction Multiclass
More informationPerceptron. by Prof. Seungchul Lee Industrial AI Lab POSTECH. Table of Contents
Perceptron by Prof. Seungchul Lee Industrial AI Lab http://isystems.unist.ac.kr/ POSTECH Table of Contents I.. Supervised Learning II.. Classification III. 3. Perceptron I. 3.. Linear Classifier II. 3..
More informationIn the Name of God. Lecture 11: Single Layer Perceptrons
1 In the Name of God Lecture 11: Single Layer Perceptrons Perceptron: architecture We consider the architecture: feed-forward NN with one layer It is sufficient to study single layer perceptrons with just
More information18.9 SUPPORT VECTOR MACHINES
744 Chapter 8. Learning from Examples is the fact that each regression problem will be easier to solve, because it involves only the examples with nonzero weight the examples whose kernels overlap the
More informationInput layer. Weight matrix [ ] Output layer
MASSACHUSETTS INSTITUTE OF TECHNOLOGY Department of Electrical Engineering and Computer Science 6.034 Artificial Intelligence, Fall 2003 Recitation 10, November 4 th & 5 th 2003 Learning by perceptrons
More informationLECTURE NOTE #8 PROF. ALAN YUILLE. Can we find a linear classifier that separates the position and negative examples?
LECTURE NOTE #8 PROF. ALAN YUILLE 1. Linear Classifiers and Perceptrons A dataset contains N samples: { (x µ, y µ ) : µ = 1 to N }, y µ {±1} Can we find a linear classifier that separates the position
More informationMachine Learning Practice Page 2 of 2 10/28/13
Machine Learning 10-701 Practice Page 2 of 2 10/28/13 1. True or False Please give an explanation for your answer, this is worth 1 pt/question. (a) (2 points) No classifier can do better than a naive Bayes
More informationBrief Introduction to Machine Learning
Brief Introduction to Machine Learning Yuh-Jye Lee Lab of Data Science and Machine Intelligence Dept. of Applied Math. at NCTU August 29, 2016 1 / 49 1 Introduction 2 Binary Classification 3 Support Vector
More informationCS798: Selected topics in Machine Learning
CS798: Selected topics in Machine Learning Support Vector Machine Jakramate Bootkrajang Department of Computer Science Chiang Mai University Jakramate Bootkrajang CS798: Selected topics in Machine Learning
More informationMachine Learning: The Perceptron. Lecture 06
Machine Learning: he Perceptron Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu 1 McCulloch-Pitts Neuron Function 0 1 w 0 activation / output function 1 w 1 w w
More informationSupport Vector Machines
Support Vector Machines Reading: Ben-Hur & Weston, A User s Guide to Support Vector Machines (linked from class web page) Notation Assume a binary classification problem. Instances are represented by vector
More informationMax Margin-Classifier
Max Margin-Classifier Oliver Schulte - CMPT 726 Bishop PRML Ch. 7 Outline Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data Kernels and Non-linear Mappings Where does the maximization
More informationPerceptron Mistake Bounds
Perceptron Mistake Bounds Mehryar Mohri, and Afshin Rostamizadeh Google Research Courant Institute of Mathematical Sciences Abstract. We present a brief survey of existing mistake bounds and introduce
More informationSingle layer NN. Neuron Model
Single layer NN We consider the simple architecture consisting of just one neuron. Generalization to a single layer with more neurons as illustrated below is easy because: M M The output units are independent
More informationSupport Vector Machines
Support Vector Machines INFO-4604, Applied Machine Learning University of Colorado Boulder September 28, 2017 Prof. Michael Paul Today Two important concepts: Margins Kernels Large Margin Classification
More informationPattern Recognition 2018 Support Vector Machines
Pattern Recognition 2018 Support Vector Machines Ad Feelders Universiteit Utrecht Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 1 / 48 Support Vector Machines Ad Feelders ( Universiteit Utrecht
More informationCOMP9444 Neural Networks and Deep Learning 2. Perceptrons. COMP9444 c Alan Blair, 2017
COMP9444 Neural Networks and Deep Learning 2. Perceptrons COMP9444 17s2 Perceptrons 1 Outline Neurons Biological and Artificial Perceptron Learning Linear Separability Multi-Layer Networks COMP9444 17s2
More information9 Classification. 9.1 Linear Classifiers
9 Classification This topic returns to prediction. Unlike linear regression where we were predicting a numeric value, in this case we are predicting a class: winner or loser, yes or no, rich or poor, positive
More informationMachine Learning, Fall 2011: Homework 5
0-60 Machine Learning, Fall 0: Homework 5 Machine Learning Department Carnegie Mellon University Due:??? Instructions There are 3 questions on this assignment. Please submit your completed homework to
More informationClassification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012
Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative
More informationSupport Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012
Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Linear classifier Which classifier? x 2 x 1 2 Linear classifier Margin concept x 2
More informationMultilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)
Multilayer Neural Networks (sometimes called Multilayer Perceptrons or MLPs) Linear separability Hyperplane In 2D: w 1 x 1 + w 2 x 2 + w 0 = 0 Feature 1 x 2 = w 1 w 2 x 1 w 0 w 2 Feature 2 A perceptron
More informationSupport Vector Machines
Support Vector Machines Hypothesis Space variable size deterministic continuous parameters Learning Algorithm linear and quadratic programming eager batch SVMs combine three important ideas Apply optimization
More informationLinear Classifiers as Pattern Detectors
Intelligent Systems: Reasoning and Recognition James L. Crowley ENSIMAG 2 / MoSIG M1 Second Semester 2014/2015 Lesson 16 8 April 2015 Contents Linear Classifiers as Pattern Detectors Notation...2 Linear
More informationSections 18.6 and 18.7 Artificial Neural Networks
Sections 18.6 and 18.7 Artificial Neural Networks CS4811 - Artificial Intelligence Nilufer Onder Department of Computer Science Michigan Technological University Outline The brain vs artifical neural networks
More informationCOMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017
COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University FEATURE EXPANSIONS FEATURE EXPANSIONS
More information1 Machine Learning Concepts (16 points)
CSCI 567 Fall 2018 Midterm Exam DO NOT OPEN EXAM UNTIL INSTRUCTED TO DO SO PLEASE TURN OFF ALL CELL PHONES Problem 1 2 3 4 5 6 Total Max 16 10 16 42 24 12 120 Points Please read the following instructions
More informationCS145: INTRODUCTION TO DATA MINING
CS145: INTRODUCTION TO DATA MINING 5: Vector Data: Support Vector Machine Instructor: Yizhou Sun yzsun@cs.ucla.edu October 18, 2017 Homework 1 Announcements Due end of the day of this Thursday (11:59pm)
More informationA Course in Machine Learning
A Course in Machine Learning Hal Daumé III 3 THE PERCEPTRON Learning Objectives: Describe the biological motivation behind the perceptron. Classify learning algorithms based on whether they are error-driven
More informationLearning: Binary Perceptron. Examples: Perceptron. Separable Case. In the space of feature vectors
Linear Classifiers CS 88 Artificial Intelligence Perceptrons and Logistic Regression Pieter Abbeel & Dan Klein University of California, Berkeley Feature Vectors Some (Simplified) Biology Very loose inspiration
More informationPAC-learning, VC Dimension and Margin-based Bounds
More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based
More informationEfficient and Principled Online Classification Algorithms for Lifelon
Efficient and Principled Online Classification Algorithms for Lifelong Learning Toyota Technological Institute at Chicago Chicago, IL USA Talk @ Lifelong Learning for Mobile Robotics Applications Workshop,
More information