Linear Classification: Perceptron

Similar documents
Linear Classification: Linear Programming

Linear Classification: Linear Programming

c i r i i=1 r 1 = [1, 2] r 2 = [0, 1] r 3 = [3, 4].

j=1 u 1jv 1j. 1/ 2 Lemma 1. An orthogonal set of vectors must be linearly independent.

A = , A 32 = n ( 1) i +j a i j det(a i j). (1) j=1

Lecture Notes: Solving Linear Systems with Gauss Elimination

Multidimensional Divide and Conquer 1 Skylines

. The following is a 3 3 orthogonal matrix: 2/3 1/3 2/3 2/3 2/3 1/3 1/3 2/3 2/3

CSC321 Lecture 4 The Perceptron Algorithm

Inexact Search is Good Enough

Lecture Notes: Eigenvalues and Eigenvectors. 1 Definitions. 2 Finding All Eigenvalues

Usually, when we first formulate a problem in mathematics, we use the most familiar

Lecture 6. Notes on Linear Algebra. Perceptron

Linear Classifiers and the Perceptron

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

Lecture Notes: Matrix Inverse. 1 Inverse Definition. 2 Inverse Existence and Uniqueness

High-Dimensional Indexing by Distributed Aggregation

The Perceptron Algorithm 1

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Linear Classifiers and the Perceptron Algorithm

Lecture 4. 1 Learning Non-Linear Classifiers. 2 The Kernel Trick. CS-621 Theory Gems September 27, 2012

Perceptron (Theory) + Linear Regression

Linear & nonlinear classifiers

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

COMS 4771 Introduction to Machine Learning. Nakul Verma

Unit 2, Section 3: Linear Combinations, Spanning, and Linear Independence Linear Combinations, Spanning, and Linear Independence

Multiclass Classification-1

Linear & nonlinear classifiers

Pre-sessional Mathematics for Big Data MSc Class 2: Linear Algebra

Logistic Regression Logistic

What if There Were No Law of Large Numbers?

LECTURE 10: REVIEW OF POWER SERIES. 1. Motivation

Linear Programming and its Extensions Prof. Prabha Shrama Department of Mathematics and Statistics Indian Institute of Technology, Kanpur

CS 446: Machine Learning Lecture 4, Part 2: On-Line Learning

Answers Machine Learning Exercises 4

6. The scalar multiple of u by c, denoted by c u is (also) in V. (closure under scalar multiplication)

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

CSC321 Lecture 4: Learning a Classifier

1 Randomized complexity

Efficient Bandit Algorithms for Online Multiclass Prediction

Support vector machines Lecture 4

The Perceptron algorithm

The Perceptron Algorithm

Error Functions & Linear Regression (2)

AN INTRODUCTION TO NEURAL NETWORKS. Scott Kuindersma November 12, 2009

FINAL EXAM: FALL 2013 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Kernelized Perceptron Support Vector Machines

LINEAR ALGEBRA: THEORY. Version: August 12,

CSC 5170: Theory of Computational Complexity Lecture 4 The Chinese University of Hong Kong 1 February 2010

Notes on Complexity Theory Last updated: December, Lecture 2

CSC 5170: Theory of Computational Complexity Lecture 5 The Chinese University of Hong Kong 8 February 2010

2. Two binary operations (addition, denoted + and multiplication, denoted

Lecture 4: Completion of a Metric Space

Homogeneous Linear Systems and Their General Solutions

a 2n = . On the other hand, the subsequence a 2n+1 =

SVMs, Duality and the Kernel Trick

Security Analytics. Topic 6: Perceptron and Support Vector Machine

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Lecture 4: Linear predictors and the Perceptron

Data Mining Part 5. Prediction

106 CHAPTER 3. TOPOLOGY OF THE REAL LINE. 2. The set of limit points of a set S is denoted L (S)

1.2 Functions What is a Function? 1.2. FUNCTIONS 11

CSC321 Lecture 4: Learning a Classifier

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

Least Mean Squares Regression

Lab 12: Structured Prediction

Multisurface Proximal Support Vector Machine Classification via Generalized Eigenvalues

Error Correcting Codes Prof. Dr. P Vijay Kumar Department of Electrical Communication Engineering Indian Institute of Science, Bangalore

MTH 2032 SemesterII

Kernels. Machine Learning CSE446 Carlos Guestrin University of Washington. October 28, Carlos Guestrin

1 Learning Linear Separators

Time-bounded computations

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

1111: Linear Algebra I

Classification with Perceptrons. Reading:

Turing Machines, diagonalization, the halting problem, reducibility

Lecture 1: Period Three Implies Chaos

Linear Classifiers: Expressiveness

Ordinary Differential Equations Prof. A. K. Nandakumaran Department of Mathematics Indian Institute of Science Bangalore

Lecture 3: the classification of equivalence relations and the definition of a topological space

CS 301. Lecture 18 Decidable languages. Stephen Checkoway. April 2, 2018

2015 Todd Neller. A.I.M.A. text figures 1995 Prentice Hall. Used by permission. Neural Networks. Todd W. Neller

2. Introduction to commutative rings (continued)

The Naïve Bayes Classifier. Machine Learning Fall 2017

i=1 α ip i, where s The analogue of subspaces

Optimization Tutorial 1. Basic Gradient Descent

Metric spaces and metrizability

HOMEWORK 4: SVMS AND KERNELS

Consequences of the Completeness Property

This last statement about dimension is only one part of a more fundamental fact.

Active Learning: Disagreement Coefficient

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

6.036 midterm review. Wednesday, March 18, 15

4 Limit and Continuity of Functions

Lecture #5. Dependencies along the genome

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

2. Prime and Maximal Ideals

FIXED POINT ITERATION

A Course in Machine Learning

Transcription:

Linear Classification: Perceptron Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong 1 / 18 Y Tao Linear Classification: Perceptron

In this lecture, we will consider an important special version of the classification task called linear classification defined as follows. Definition 1. Let R d denote the d-dimensional space where the domain of each dimension is the set R of real values. Let P be a set of points in R d, each of which is colored either red or blue. The goal of the linear classification problem is to determine whether there is a d-dimensional plane x 1 c 1 + x 2 c 2 +... + x d c d = λ which separates the red points from the blue points in P. In other words, all the red points must fall on the same side of the plane, while all the blue points must fall on the other side. No point is allowed to fall on the plane. If the plane exists, then P is said to be linearly separable. Otherwise, P is linearly non-separable. 2 / 18 Y Tao Linear Classification: Perceptron

Example 2. Linearly separable Linearly non-separable 3 / 18 Y Tao Linear Classification: Perceptron

Think Why do we call the linear classification problem a special case of the classification task? In particular, what is the model, and how do we perform classification on an unknown object? Some reasons why linear classification is important: The datasets in many classification tasks of reality are linearly separable. This is especially true when the objects of the same class form a cluster such that the yes and no clusters are far apart. Linear classification is a scientific problem that has a unique correct answer (i.e., whether the input set is linearly separable). 4 / 18 Y Tao Linear Classification: Perceptron

We will first look at a slightly different version of the problem: Definition 3. Let R d denote the d-dimensional space where the domain of each dimension is the set R of real values. Let P be a set of points in R d, each of which is colored either red or blue. The goal is to determine whether there is a d-dimensional plane x 1 c 1 + x 2 c 2 +... + x d c d = 0 such that all the red points must fall on the same side of the plane, while all the blue points must fall on the other side. No point is allowed to fall on the plane. Note that we have restricted the target plane to pass the origin (0, 0,..., 0) of R d. It turns out that the original problem in Definition 1 can be converted into the problem in Definition 3. We will come back to this issue later. 5 / 18 Y Tao Linear Classification: Perceptron

We assume that no point in P is at the origin. We further require that if a point p = (x 1, x 2,..., x d ) is red, then it must hold that x 1 c 1 + x 2 c 2 +... + x d c d > 0 Otherwise (i.e., p is blue), then it must hold that x 1 c 1 + x 2 c 2 +... + x d c d < 0 Think Why are we allowed to make these requirements without worrying about which color should be assigned to the > 0 case? 6 / 18 Y Tao Linear Classification: Perceptron

Next, we will discuss an algorithm called perceptron. Strictly speaking, this algorithm does not really settle the linear classification problem because when the input set P is linearly non-separable, the algorithm runs forever and never terminates. However, this shortcoming does not prevent perceptron from being a classical method in machine learning. As we will see, the algorithm is surprisingly simple, and guarantees to find a separation plane on a linearly separable dataset. 7 / 18 Y Tao Linear Classification: Perceptron

To facilitate our presentation, let us introduce some (standard) definitions and notations on vectors: Define a vector v to be a sequence of d real values (v 1, v 2,..., v d ). Given two vectors v 1 = (a 1, a 2,..., a d ) and v 2 = (b 1, b 2,..., b d ), define: v 1 v 2 as the real value d i=1 a ib i. v 1 + v 2 as the vector (a 1 + b 1, a 2 + b 2,..., a d + b d ). v 1 v 2 as the vector (a 1 b 1, a 2 b 2,..., a d b d ). Each point p(x 1,..., x d ) corresponds to a vector p = (x 1,..., x d ). Denote c = (c 1,..., c d ) where c 1,..., c d are the coefficients of the plane in Definition 3. Hence, if p is a red point, we require p c > 0; otherwise, we require p c < 0. 8 / 18 Y Tao Linear Classification: Perceptron

Perceptron The algorithm starts with c = (0, 0,..., 0), and then runs in iterations. In each iteration, it simply checks whether any point in p P violates our requirement according to c. If so, the algorithm adjusts c as follows: If p is red, then c c + p. If p is blue, then c c p. As soon as c has been adjusted, the current iteration finishes; and a new iteration starts. The algorithm finishes if the iteration finds all points of P on the right side of the plane. 9 / 18 Y Tao Linear Classification: Perceptron

Example 4. Suppose that P has four 2d points: p 1 = (1, 0), p 2 = (0, 1), p 3 = (0, 1), and p 4 = ( 1, 0). Points p 1 and p 3 are red, and the other are blue. The algorithm starts with c = (0, 0,..., 0). Iteration 1: p 1 c = 0 violating our requirement p 1 c > 0. Hence, we update c to c + p 1 = (1, 0). Iteration 2: p 2 c = 0 violating our requirement p 2 c < 0. Hence, we update c to c p 2 = (1, 0) (0, 1) = (1, 1). Iteration 3: all points satisfy our requirements. Thus, the algorithm finishes with c = (1, 1). 10 / 18 Y Tao Linear Classification: Perceptron

Next, we will prove an important theorem, for which purpose we introduce one more notion on a vector v = (v 1,..., v d ): The length of v, denoted as v, is defined to be v v = d i=1 v 2 i. It is well-known that dot product has the following property: v 1 v 2 v 1 v 2 for any v 1 and v 2. Theorem 5. Perceptron always terminates on a linearly separable dataset P Proof. Since P is linearly separable, there is a c such that for all the red points p r P, we have p r c > 0, while for all the blue points p b P, we have p b c < 0. Furthermore, we can assume that c = 1 (think: why?). Next, we will use u to denote this vector c (because we will use the notation c for other purposes). 11 / 18 Y Tao Linear Classification: Perceptron

Proof (cont.). Define: γ = min{ p u } p P Note that γ > 0. Also define: R = max p P { p } Recall that the perceptron algorithm adjusts c in each iteration. Let c i (i 1) be the c after the i-th iteration. Also, let c 0 = (0,..., 0) be the initial c before the first iteration. Also, let k be the total number of iterations. 12 / 18 Y Tao Linear Classification: Perceptron

Proof (cont.). We will first prove, for any i 0, c i+1 u c i u + γ. Recall that c i+1 was adjusted from c i in one of the following cases: A red point p r violates our requirement, namely, p r c i < 0. In this case, c i+1 = c i + p r ; and hence, c i+1 u = c i u + p r u. From the definition of γ, we know that p r u γ. Therefore, c i+1 u c i u + γ. A blue point p b violates our requirement, namely, p b c i > 0. In this case, c i+1 = c i p b ; and hence, c i+1 u = c i u p b u. From the definition of γ, we know that p b u γ. Therefore, c i+1 u c i u + γ. It follows that c k u c k 1 u + γ... c k 2 u + 2γ c 0 u + kγ = kγ. (1) 13 / 18 Y Tao Linear Classification: Perceptron

Proof (cont.). We will now prove, for any i 0, c i+1 2 c i 2 + R 2. Recall that c i+1 was adjusted from c i in one of the following cases: A red point p r violates our requirement, namely, p r c i < 0. In this case, c i+1 = c i + p r. Thus: c i+1 2 = c i+1 c i+1 = ( c i + p r ) ( c i + p r ) = c i c i + 2 c i p r + p r (by def. of R) c i 2 + 2 c i p r + R 2 c i 2 + R 2 where the last step used the fact that p r c < 0. A blue point p b violates our requirement, namely, p b c i > 0. The proof is similar, and omitted (a good exercise for you). It follows that c k 2 c k 1 2 + R 2 c k 2 2 + 2R 2... c 0 2 + kr 2 = kr 2. (2) 14 / 18 Y Tao Linear Classification: Perceptron

Proof (cont.). Now we combine (1) and (2) to obtain an upper bound on k. From (1), we know that c k = c k u c k u kγ. Therefore, c k 2 k 2 γ 2. Comparing this to (2) gives: kr 2 k 2 γ 2 k R2 γ 2. 15 / 18 Y Tao Linear Classification: Perceptron

There is only one issue remaining. Recall that our original goal was to solve the problem in Definition 1. We instead turned to solve the problem in Definition 3, and claimed that it is ok. Next, we will establish this claim. Let P be a d-dimensional dataset which is the input to Definition 1. We create another dataset P of dimensionality d + 1 for Definition 3 as follows. For each point p(x 1,..., x d ) in P, create a point p (x 1,..., x d, 1) in P, namely, add one more dimension on which the coordinates of all points of P are fixed to 1. The color of p is the same as that of p. 16 / 18 Y Tao Linear Classification: Perceptron

Now we prove that the problem of Definition 1 can be converted to that of Definition 3. Lemma 6. P is linearly separable if and only if P is linearly separable. Proof. Direction If. Suppose that x 1 c 1 + x 2 c 2 +...x d c d + x d+1 c d+1 = 0 is a separation plane in Definition 3. This means that for every red point p(x 1, x 2,..., x d, 1) in P, it holds that x 1 c 1 + x 2 c 2 +...x d c d + 1 c d+1 > 0. Also, for every blue point p(x 1, x 2,..., x d, 1) in P, it holds that x 1 c 1 + x 2 c 2 +...x d c d + 1 c d+1 < 0. This means that x 1 c 1 + x 2 c 2 +...x d c d + c d+1 = 0 is a separation plane in Definition 1. 17 / 18 Y Tao Linear Classification: Perceptron

Proof (cont.). Direction Only-If. Suppose that x 1 c 1 + x 2 c 2 +...x d c d + c d+1 = 0 is a separation plane in Definition 1. This means that for every red point p(x 1, x 2,..., x d ) in P, it holds that x 1 c 1 + x 2 c 2 +...x d c d + c d+1 > 0. Also, for every blue point p(x 1, x 2,..., x d ) in P, it holds that x 1 c 1 + x 2 c 2 +...x d c d + c d+1 < 0. This means that x 1 c 1 + x 2 c 2 +...x d c d + x d+1 c d+1 = 0 is a separation plane in Definition 3. 18 / 18 Y Tao Linear Classification: Perceptron