Linear Classification: Perceptron

Linear Classification: Perceptron Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong 1 / 18 Y Tao Linear Classification: Perceptron

In this lecture, we will consider an important special version of the classification task called linear classification defined as follows. Definition 1. Let R d denote the d-dimensional space where the domain of each dimension is the set R of real values. Let P be a set of points in R d, each of which is colored either red or blue. The goal of the linear classification problem is to determine whether there is a d-dimensional plane x 1 c 1 + x 2 c 2 +... + x d c d = λ which separates the red points from the blue points in P. In other words, all the red points must fall on the same side of the plane, while all the blue points must fall on the other side. No point is allowed to fall on the plane. If the plane exists, then P is said to be linearly separable. Otherwise, P is linearly non-separable. 2 / 18 Y Tao Linear Classification: Perceptron

Example 2. Linearly separable Linearly non-separable 3 / 18 Y Tao Linear Classification: Perceptron

Think Why do we call the linear classification problem a special case of the classification task? In particular, what is the model, and how do we perform classification on an unknown object? Some reasons why linear classification is important: The datasets in many classification tasks of reality are linearly separable. This is especially true when the objects of the same class form a cluster such that the yes and no clusters are far apart. Linear classification is a scientific problem that has a unique correct answer (i.e., whether the input set is linearly separable). 4 / 18 Y Tao Linear Classification: Perceptron

We will first look at a slightly different version of the problem: Definition 3. Let R d denote the d-dimensional space where the domain of each dimension is the set R of real values. Let P be a set of points in R d, each of which is colored either red or blue. The goal is to determine whether there is a d-dimensional plane x 1 c 1 + x 2 c 2 +... + x d c d = 0 such that all the red points must fall on the same side of the plane, while all the blue points must fall on the other side. No point is allowed to fall on the plane. Note that we have restricted the target plane to pass the origin (0, 0,..., 0) of R d. It turns out that the original problem in Definition 1 can be converted into the problem in Definition 3. We will come back to this issue later. 5 / 18 Y Tao Linear Classification: Perceptron

We assume that no point in P is at the origin. We further require that if a point p = (x 1, x 2,..., x d ) is red, then it must hold that x 1 c 1 + x 2 c 2 +... + x d c d > 0 Otherwise (i.e., p is blue), then it must hold that x 1 c 1 + x 2 c 2 +... + x d c d < 0 Think Why are we allowed to make these requirements without worrying about which color should be assigned to the > 0 case? 6 / 18 Y Tao Linear Classification: Perceptron

Next, we will discuss an algorithm called perceptron. Strictly speaking, this algorithm does not really settle the linear classification problem because when the input set P is linearly non-separable, the algorithm runs forever and never terminates. However, this shortcoming does not prevent perceptron from being a classical method in machine learning. As we will see, the algorithm is surprisingly simple, and guarantees to find a separation plane on a linearly separable dataset. 7 / 18 Y Tao Linear Classification: Perceptron

To facilitate our presentation, let us introduce some (standard) definitions and notations on vectors: Define a vector v to be a sequence of d real values (v 1, v 2,..., v d ). Given two vectors v 1 = (a 1, a 2,..., a d ) and v 2 = (b 1, b 2,..., b d ), define: v 1 v 2 as the real value d i=1 a ib i. v 1 + v 2 as the vector (a 1 + b 1, a 2 + b 2,..., a d + b d ). v 1 v 2 as the vector (a 1 b 1, a 2 b 2,..., a d b d ). Each point p(x 1,..., x d ) corresponds to a vector p = (x 1,..., x d ). Denote c = (c 1,..., c d ) where c 1,..., c d are the coefficients of the plane in Definition 3. Hence, if p is a red point, we require p c > 0; otherwise, we require p c < 0. 8 / 18 Y Tao Linear Classification: Perceptron

Perceptron The algorithm starts with c = (0, 0,..., 0), and then runs in iterations. In each iteration, it simply checks whether any point in p P violates our requirement according to c. If so, the algorithm adjusts c as follows: If p is red, then c c + p. If p is blue, then c c p. As soon as c has been adjusted, the current iteration finishes; and a new iteration starts. The algorithm finishes if the iteration finds all points of P on the right side of the plane. 9 / 18 Y Tao Linear Classification: Perceptron

Example 4. Suppose that P has four 2d points: p 1 = (1, 0), p 2 = (0, 1), p 3 = (0, 1), and p 4 = ( 1, 0). Points p 1 and p 3 are red, and the other are blue. The algorithm starts with c = (0, 0,..., 0). Iteration 1: p 1 c = 0 violating our requirement p 1 c > 0. Hence, we update c to c + p 1 = (1, 0). Iteration 2: p 2 c = 0 violating our requirement p 2 c < 0. Hence, we update c to c p 2 = (1, 0) (0, 1) = (1, 1). Iteration 3: all points satisfy our requirements. Thus, the algorithm finishes with c = (1, 1). 10 / 18 Y Tao Linear Classification: Perceptron

Next, we will prove an important theorem, for which purpose we introduce one more notion on a vector v = (v 1,..., v d ): The length of v, denoted as v, is defined to be v v = d i=1 v 2 i. It is well-known that dot product has the following property: v 1 v 2 v 1 v 2 for any v 1 and v 2. Theorem 5. Perceptron always terminates on a linearly separable dataset P Proof. Since P is linearly separable, there is a c such that for all the red points p r P, we have p r c > 0, while for all the blue points p b P, we have p b c < 0. Furthermore, we can assume that c = 1 (think: why?). Next, we will use u to denote this vector c (because we will use the notation c for other purposes). 11 / 18 Y Tao Linear Classification: Perceptron

Proof (cont.). Define: γ = min{ p u } p P Note that γ > 0. Also define: R = max p P { p } Recall that the perceptron algorithm adjusts c in each iteration. Let c i (i 1) be the c after the i-th iteration. Also, let c 0 = (0,..., 0) be the initial c before the first iteration. Also, let k be the total number of iterations. 12 / 18 Y Tao Linear Classification: Perceptron

Proof (cont.). We will first prove, for any i 0, c i+1 u c i u + γ. Recall that c i+1 was adjusted from c i in one of the following cases: A red point p r violates our requirement, namely, p r c i < 0. In this case, c i+1 = c i + p r ; and hence, c i+1 u = c i u + p r u. From the definition of γ, we know that p r u γ. Therefore, c i+1 u c i u + γ. A blue point p b violates our requirement, namely, p b c i > 0. In this case, c i+1 = c i p b ; and hence, c i+1 u = c i u p b u. From the definition of γ, we know that p b u γ. Therefore, c i+1 u c i u + γ. It follows that c k u c k 1 u + γ... c k 2 u + 2γ c 0 u + kγ = kγ. (1) 13 / 18 Y Tao Linear Classification: Perceptron

Proof (cont.). We will now prove, for any i 0, c i+1 2 c i 2 + R 2. Recall that c i+1 was adjusted from c i in one of the following cases: A red point p r violates our requirement, namely, p r c i < 0. In this case, c i+1 = c i + p r. Thus: c i+1 2 = c i+1 c i+1 = ( c i + p r ) ( c i + p r ) = c i c i + 2 c i p r + p r (by def. of R) c i 2 + 2 c i p r + R 2 c i 2 + R 2 where the last step used the fact that p r c < 0. A blue point p b violates our requirement, namely, p b c i > 0. The proof is similar, and omitted (a good exercise for you). It follows that c k 2 c k 1 2 + R 2 c k 2 2 + 2R 2... c 0 2 + kr 2 = kr 2. (2) 14 / 18 Y Tao Linear Classification: Perceptron

Proof (cont.). Now we combine (1) and (2) to obtain an upper bound on k. From (1), we know that c k = c k u c k u kγ. Therefore, c k 2 k 2 γ 2. Comparing this to (2) gives: kr 2 k 2 γ 2 k R2 γ 2. 15 / 18 Y Tao Linear Classification: Perceptron

There is only one issue remaining. Recall that our original goal was to solve the problem in Definition 1. We instead turned to solve the problem in Definition 3, and claimed that it is ok. Next, we will establish this claim. Let P be a d-dimensional dataset which is the input to Definition 1. We create another dataset P of dimensionality d + 1 for Definition 3 as follows. For each point p(x 1,..., x d ) in P, create a point p (x 1,..., x d, 1) in P, namely, add one more dimension on which the coordinates of all points of P are fixed to 1. The color of p is the same as that of p. 16 / 18 Y Tao Linear Classification: Perceptron

Now we prove that the problem of Definition 1 can be converted to that of Definition 3. Lemma 6. P is linearly separable if and only if P is linearly separable. Proof. Direction If. Suppose that x 1 c 1 + x 2 c 2 +...x d c d + x d+1 c d+1 = 0 is a separation plane in Definition 3. This means that for every red point p(x 1, x 2,..., x d, 1) in P, it holds that x 1 c 1 + x 2 c 2 +...x d c d + 1 c d+1 > 0. Also, for every blue point p(x 1, x 2,..., x d, 1) in P, it holds that x 1 c 1 + x 2 c 2 +...x d c d + 1 c d+1 < 0. This means that x 1 c 1 + x 2 c 2 +...x d c d + c d+1 = 0 is a separation plane in Definition 1. 17 / 18 Y Tao Linear Classification: Perceptron

Proof (cont.). Direction Only-If. Suppose that x 1 c 1 + x 2 c 2 +...x d c d + c d+1 = 0 is a separation plane in Definition 1. This means that for every red point p(x 1, x 2,..., x d ) in P, it holds that x 1 c 1 + x 2 c 2 +...x d c d + c d+1 > 0. Also, for every blue point p(x 1, x 2,..., x d ) in P, it holds that x 1 c 1 + x 2 c 2 +...x d c d + c d+1 < 0. This means that x 1 c 1 + x 2 c 2 +...x d c d + x d+1 c d+1 = 0 is a separation plane in Definition 3. 18 / 18 Y Tao Linear Classification: Perceptron