The Perceptron algorithm Tirgul 3 November 2016
Agnostic PAC Learnability A hypothesis class H is agnostic PAC learnable if there exists a function m H : 0,1 2 N and a learning algorithm with the following property: For every ε, δ 0,1, and for every distribution D over X Y, when running the learning algorithm on m m H ε, δ i.i.d. examples generated by D, the algorithm returns a hypothesis h such that, with probability of at least 1 δ (over the choice of the m training examples), L D h min h H L D h + ε 2
Agnostic PAC Learnability: L D h min h H L D h + ε Goal: what is h = argmin h H L D h
When Life Gives You Lemons, Make Lemonade We do have our sample set S and we hope it represents the distribution pretty good i.i.d assumption So why can t we just minimize the error over the training set? In other words - Empirical Risk Minimization 4
Empirical Risk Minimization Examples Consistent Halving L s h = i m : h(x i y i } m 5
Linear Predictors ERM Approach 6
Introduction Linear Predictors: Efficient Intuitive Fit data reasonably well in many natural learning problems Several hypothesis classes: linear regression, logistic regression, Perceptron. 7
Example
Linear Predictors The different hypothesis classes of linear predictors are compositions of a function φ: R Y on H: Binary classification: φ is the sign function sgn x. Regression: φ is the identity function (f x = x). 9
Halfspaces Designed for binary classification problems X = R d, Y = ±1 H halfspaces = {x sign w, x + b : w R d } Geometric illustration (d=2): each hypothesis forms a hyperplane that is perpendicular to the vector w. Instances that are above the hyperplane are labeled positively Instances that are below the hyperplane are labeled negatively. 10
Adding a Bias Add b (a bias) into w as an extra coordinate: w = b, w 1, w 2,, w d R d+1 and add a value of 1 to all x X: x = 1, x 1, x 2,, x d R d+1 Thus, each affine function in R d can be rewritten as a homogenous linear function in R d+1. 11
The Dot Product Algebraic Defintion: n w x = i=1 w i x i = i=1 Notation: w, x = w T x a = 0 3, b = 4 0 a b = 0 4 + 3 0 = 0 n w 1 x 1 + w 2 x 2 + + w n x n b θ a 12
The Dot Product Geometric Definition: a b = a b cos θ Where: x is the magnitude of vector x. θ is the angle between a and b. b θ a If θ = 90 : a b = 0 If θ = 0 : a b = a b This implies that that dot product of a vector by itself is: a a = a 2 Which gives: a = a a The formula of the Euclidean length of the vector. 13
The Decision Boundary Perceptron tries to find a straight line that separates between the positive and negative examples A line in 2D, a plane in 3D, a hyperplane in higher dimensions Called a decision boundary. 14
The Linearly Separable Case The linearly separable case: If a perfect decision boundary exists (The realizable case.) The separable case: Possible to separate with a hyperplane all the positive examples from all the negative ones. 15
Finding an ERM Halfspace In the separable case: Linear Programming The Perceptron Algorithm (Rosenblatt, 1957) In the non-separable case: Learn a halfspace that minimizes a different loss function E.g. Logistic Regression 16
Perceptron x i - inputs; w i - weights The inputs x i are multiplied by the weights w i The neuron sums their values. If the sum is greater than the threshold θ then the neuron fires (1); Otherwise, it does not. 17
Finding θ We now need to learn both w and θ: 18
Finding θ θ is equivalent to the parameter b we mentioned previously. Reminder: we added a bias Thus, we have a adjustable threshold Don t need to learn another parameter x 0 w 0 = θ 19
Perceptron for Halfspaces
Perceptron for Halfspaces Our goal is to have i, y i w, x i > 0 y i w t+1, x i = y i w t + y i x i, x i = y i w t, x i + x i 2 The update rule makes Perceptron more correct on the ith example 21
The Learning Rate η The update rule: w (t+1) = w t + y i x i We could add a parameter η w (t+1) = w t + ηy i x i Controls how much the weights will change In the separable case η has no affect. Proof: HW If η = 1: The weights change a lot whenever there is a wrong answer Unstable network, never settles down. If η is very small: The weights need to see inputs more often before they change significantly Network takes longer to learn Typically choose 0.1 η 0.4 22
Example: Logic Function OR Data of the OR logic function and a plot of the datapoints: -1 23
A Feasibility Problem Suppose the algorithm found a weight vector that learned all of the examples correctly There are many different values that will give correct outputs! We are interested in finding a set of weights that works feasibility A feasibility (/satisfiability) problem, is the problem of finding any feasible solution, without regard to the objective value 24
Example: Logic Function OR The Perceptron network 1 25
Example: Logic Function OR We need to find the three weights: Initially: w (1) = 0, 0, 0 The algorithm: First input: x 1 = 0, 0, y 1 = 1 Include the bias: 1, 0, 0 Value of neuron: w (1) x 1 = 0 1 + 0 0 + 0 0 = 0 y 1 (w (1) x 1 ) = 0 0 Update: w 2 = w (1) + ( 1)(1, 0, 0) w 2 = ( 1, 0, 0) The dataset: -1 26
Example: Logic Function OR w (2) = 1, 0, 0 The algorithm: Second input: x 2 = 0, 1, y 2 = 1 Include the bias: 1, 0, 1 Value of neuron: w (2) x 2 = 1 1 + 0 0 + 0 1 = 1 y 2 (w (2) x 2 ) = 1 0 Update: w 3 = w (2) + (1)(1, 0, 1) w 3 = (0, 0, 1) The dataset: -1 27
Example: Logic Function OR w (3) = 0, 0, 1 The algorithm: Third input: x 3 = 1, 0, y 3 = 1 Include the bias: 1, 1, 0 Value of neuron: w (3) x 3 = 0 1 + 0 1 + 1 0 = 0 y 3 (w (3) x 3 ) = 0 0 Update: w 4 = w (3) + (1)(1, 1, 0) w 4 = (1, 1, 1) The dataset: -1 28
Example: Logic Function OR w (4) = 1, 1, 1 Fourth input: x 4 = 1, 1, y 4 = 1 Include the bias: 1, 1, 1 Value of neuron: w (4) x 4 = 1 1 + 1 1 + 1 1 = 3 y 4 (w (4) x 4 ) = 3 0 No update The algorithm: The dataset: -1 29
Example: Logic Function OR Not done yet! w (4) = 1, 1, 1 First input: x 1 = 0, 0, y 1 = 1 Include the bias: 1, 0, 0 Value of neuron: w (4) x 1 = 1 1 + 1 0 + 1 0 = 1 y 1 (w (4) x 1 ) = 1 0 Need to update again The algorithm: The dataset: -1 30
Example: Logic Function OR We ve been through all the inputs once but that doesn t mean we finished! Need to go through the inputs again Till the weights settle down and stop changing When data is inseparable, the weights may never stop changing 31
When to stop? The algorithm runs over the dataset many times How to decide when to stop learning? (generally) 32
Validation Set Training set: to train the algorithm To adjust the weights Validation set: to keep tack of how well it is doing To verify that any increase in accuracy over the training data yields an increase in accuracy over a dataset that the network wasn t trained on. Test set: to produce the final results To test the final solution, in order to confirm the actual predictive power of the algorithm. 33
Validation Set Proportion of train/validation/test sets Are typically 60:20:20 (after the dataset has been shuffled!) Alternatively: K-fold Cross validation The dataset is randomly partitioned into K subsets One subset is used for validation; the algorithm is trained on all the others Then, a different subset is let out, and a new model is trained Repeat the process for all K subsets Finally, the model that produced the lowest validation error is used. Leave-one-out Algorithm is validated on one piece of data, and trained on all the rest, N times. Length of dataset 34
Overfitting Rather than finding a general function (left), the our network matches the input perfectly, including the noise in them (right). Reduces generalization capabilities. 35
Back to: When to stop? If we plot the error during training: Note: This graph is general and does not necessarily describe the behavior of the error rate while training Perceptron, because Perceptron does not guarantee that there will be fewer mistakes on the next iterations. it typically reduces fairly quickly during the first few training iterations, then the reduction slows down as the learning algorithm performs small changes to find the exact local minimum. 36
When to stop? We don't want to stop training until the local minimum has been found, but, keeping on training too long leads to overfitting. This is where the validation set comes in useful. 37
When to stop? We train the network for some predetermined amount of time, and then use the validation set to estimate how well the network is generalizing. We then carry on training for a few more iterations, and repeat the whole process. 38
When to stop? At some stage the error on the validation set will start increasing again, because the network has stopped learning about the function that generated the data, and started to learn about the noise that is in the data itself. At this stage we stop the training. This technique is called early stopping. 39
When to stop? Thus, the validation set was used to prevent overfitting, and to monitor the generalization ability of the network: If the accuracy over the training data set increases, but the accuracy over then validation data set stays the same or decreases, then we have caused overfitting, and should stop training. 40
Perceptron Variant The pocket algorithm: Keeps the best solution seen so far "in its pocket". The algorithm then returns the solution in the pocket, rather than the last solution. 41
Perceptron Bound Theorem Note: γ is called a margin.