The Perceptron Algorithm Greg Grudic Greg Grudic Machine Learning
Questions? Greg Grudic Machine Learning 2
Binary Classification A binary classifier is a mapping from a set of d inputs to a single output which can take on one of TWO values In the most general setting inputs: x R output: y {, + } Specifying the output classes as - and + is arbitrary! Often done as a mathematical convenience d Greg Grudic Machine Learning 3
A Binary Classifier Given learning data: ( x, y ),...,( x, y ) A model is constructed: N N x Classification Model M ( x) yˆ {, + } Greg Grudic Machine Learning 4
Linear Separating Hyper-Planes x 2 d + i i = i= β β x y =+ y = d + i i > i= β β x d + i i i= β β x x Greg Grudic Machine Learning 5
Linear Separating Hyper-Planes The Model: Where: ˆ = ( ) = sgn ˆ + ˆ,..., ˆ The decision boundary: ( ) y M x β β β d sgn [ A] if A> = otherwise ( ) ˆ ˆ,..., ˆ T + x = β β β d x T Greg Grudic Machine Learning 6
Linear Separating Hyper-Planes The model parameters are: ˆ ˆ ˆ ( β ), β,..., βd The hat on the betas means that they are estimated from the data Many different learning algorithms have been proposed for determining β β ˆ ˆ ˆ ( ),,..., βd Greg Grudic Machine Learning 7
Rosenblatt s Preceptron Learning Algorithm Dates back to the 95 s and is the motivation behind Neural Networks The algorithm: Start with a random hyperplane ˆ ˆ ˆ ( β ), β,..., βd Incrementally modify the hyperplane such that points that are misclassified move closer to the correct side of the boundary Stop when all learning examples are correctly classified Greg Grudic Machine Learning 8
Rosenblatt s Preceptron Learning Algorithm The algorithm is based on the following property: Signed distance of any point x to the boundary is: Therefore, if M is the set of misclassified learning examples, we can push them closer to the boundary by minimizing the following D (,..., d ) ˆ ˆ ˆ T β+ β β x d = βˆ ˆ ˆ + β β d ˆ 2 β i i= (,..., d ) ( T x ) ˆ ˆ ˆ ˆ ˆ ˆ ( β ) ( ), β,..., β = y β+ β,..., β d i d i i M x T Greg Grudic Machine Learning 9
Rosenblatt s Minimization Function This is classic Machine Learning! First define a cost (risk/loss) function in model parameter space d D( βˆ ˆ ˆ ) ˆ ˆ, β,..., βd = y i β+ βkx ik i M k= Then find an algorithm that modifies such that this cost function is minimized One such algorithm is Gradient Descent ˆ ˆ ˆ ( β ), β,..., βd Greg Grudic Machine Learning
Gradient Descent ( ˆ, ˆ ˆ,..., d ) D β β β = ˆβ = ˆβ = Greg Grudic Machine Learning
The Gradient Descent Algorithm βˆ βˆ ρ i i ( ˆ, ˆ,..., β ˆ ) d D β β βˆ i Where the learning rate is defined by: ρ > Greg Grudic Machine Learning 2
The Gradient Descent Algorithm for D ( βˆ, βˆ ˆ,..., βd ) βˆ = i M βˆ ˆ β y i βˆ ˆ yx β i i ρ βˆ ˆ yx i id d β d the Perceptron y i D ( β ˆ, β ˆ ˆ,..., βd ) βˆ = y x, j =,..., d Greg Grudic Machine Learning 3 j i M Two Versions of the Perceptron Algorithm: Update One misclassified point at a time (online) i y ˆ ˆ i β β i M βˆ ˆ yixi β i M ρ βˆ ˆ d β d yix id ij i M Update all misclassified points at once (batch)
The Learning Data Training Data: ( x, y),...,( xn, yn) Matrix Representation of N learning examples of d dimensional inputs X x x d y =, Y = x x y N Nd N Greg Grudic Machine Learning 4
The Good Theoretical Properties of the Perceptron Algorithm If a solution exists the algorithm will always converge in a finite number of steps! Question: Does a solution always exist? Greg Grudic Machine Learning 5
Linearly Separable Data Which of these datasets are separable by a linear boundary? + + - - - + + a) - - + b) - Greg Grudic Machine Learning 6
Linearly Separable Data Which of these datasets are separable by a linear boundary? + + - - - + + a) - - + b) - Not Linearly Separable! Greg Grudic Machine Learning 7
Bad Theoretical Properties of the Perceptron Algorithm If the data is not linearly separable, algorithm cycles forever! Cannot converge! This property stopped active research in this area between 968 and 984 Perceptrons, Minsky and Pappert, 969 Even when the data is separable, there are infinitely many solutions Which solution is best? When data is linearly separable, the number of steps to converge can be very large (depends on size of gap between classes) Greg Grudic Machine Learning 8
What about Nonlinear Data? Data that is not linearly separable is called nonlinear data Nonlinear data can often be mapped into a nonlinear space where it is linearly separable Greg Grudic Machine Learning 9
Nonlinear Models The Linear Model: yˆ = M( x) = sgn βˆ ˆ + βix i i= The Nonlinear (basis function) Model: yˆ = M( x) = sgn βˆ ˆ + βφ i i( ) x i= Examples of Nonlinear Basis Functions: 2 2 ( x) = x ( x) = x ( x) = x x ( x) = sin( x ) φ φ φ φ 2 2 3 2 4 55 k d Greg Grudic Machine Learning 2
Linear Separating Hyper-Planes In φ 2 β Nonlinear Basis Function Space k + βφ i i = i= y =+ y = β k + βφ i i > i= β k + βφ i i i= φ Greg Grudic Machine Learning 2
An Example.8.6.4 : y=+ : y=- Φ.2 x 2 -.2 -.4.9 : y=+ : y=- -.6.8 -.8.7 - - -.8 -.6 -.4 -.2.2.4.6.8 x φ 2 = x 2 2.6.5.4.3.2...2.3.4.5.6.7.8.9 Greg Grudic Machine Learning 22 φ = x 2
Polynomial K Sigmoid K Kernels as Nonlinear Transformations ( x, x ) = ( x, x + q) i j i j ( x, x ) = tanh ( κ x, x + θ) i j i j Gaussian or Radial Basis Function (RBF) K 2 x, x = exp x x i j 2 2 i j σ ( ) k Greg Grudic Machine Learning 23
The Kernel Model Training Data: ( x, y ),...,( x, y ) N N N yˆ = M( x) = sgn βˆ ˆ + βik xi, x i= ( ) The number of basis functions equals the number of training examples! - Unless some of the beta s get set to zero Greg Grudic Machine Learning 24
Training Data: K Gram (Kernel) Matrix ( x, y ),...,( x, y ) K( ) ( ), K, x x x xn = K( x, x ) K( x, x ) N N N Properties: Positive Definite Matrix Symmetric Positive on diagonal N by N Greg Grudic Machine Learning 25 N N
Picking a Model Structure? How do you pick the Kernels? Kernel parameters These are called learning parameters or hyperparamters Two approaches choosing learning paramters Bayesian Learning parameters must maximize probability of correct classification on future data based on prior biases Frequentist Use the training data to learn the model parameters Use validation data to pick the best hyperparameters. ( βˆ, βˆ ˆ,..., βd ) Greg Grudic Machine Learning 26
Perceptron Algorithm Convergence Two problems: No convergence when data is not separable in basis function space Gives infinitely many solutions when data is separable Can we modify the algorithm to fix these problems? Greg Grudic Machine Learning 27