3. Linear discrimination and single-layer neural networks

Size: px

Start display at page:

Download "3. Linear discrimination and single-layer neural networks"

Malcolm Francis
6 years ago
Views:

1 History: Sept 6, 014: created Oct 1, 014: corrected typos, made fig 3. visible 3. Linear discrimination and single-layer neural networs In this section we will treat a special case of two-class classification, namely, linear discrimination. Together with the maths we will introduce a particular conceptual / graphical notation, namely, cast the classification algorithm as a neural networ. Linear discrimination is the basis for more advanced techniques that we will treat later, but already by itself it can be applied in clever ways that mae it quite powerful. I follow closely chapter 3 of Bishop's boo Linear discrimination: two classes Recall. Toward the end of Section we introduced discriminant functions as monotonically increasing functions f (3.1) y i (e) = f(p(e C i ) P(C i )) of the class-conditional probability times the prior. For the case of binary discrimination we mentioned that one can introduce a single discrimination function (3.) y(e) = y 1 (e) y (e) and decide that e falls into class 1 whenever y(e) > 0. We remared that it is sometimes easier to learn discriminant functions directly from the training data, than first estimate the class-conditional distributions and then construct the discriminant function from those distributions in the second place. This is the approach we will tae in this section: we ignore the connection of discriminant functions with distributions and start directly from a given functional form of the two-class discriminant function (3.), namely, linear discriminants of the form (3.3) y(x) = w T x + w 0, where w is a weight (column) vector and w 0 is a bias. The vector x here is either the raw input data example (if it comes in the form of a real-valued vector, as for instance the 40-dimensional pixel value vectors for our digit images), or some suitably transformed version of the raw data example (for instance, a one-dimensional feature value or a feature vector). For the case of two-dimensional inputs x = (x 1, x ), linear discriminants can be visualized as in Fig See figure caption for a geometrical interpretation. 1

2 In a neural networ interpretation, a linear discriminant corresponds to a networ with M + 1 input neurons, where M is the dimension of the inputs x, and a single output neuron, where the output y(x) of the discriminant is read from. The first input neuron x 0 receives constant input 1, the remaining input neurons receive input x = (x 1,..., x M ). The total input is x = (1, x1,..., x M ) =: (x 0, x 1,..., x M ). The networ weights are (w 0, w 1,..., w M ) = (w 0, w T ) = w T. The output in this networ is computed in the output neuron by summing up the inputs, weighted by the weights, which gives (3.4) y(x) = w T x = w 0 + w T x. y Fig. 3.1 Geometrical interpretation of two-class linear discriminant y(x) = w T x + w 0 for two-dimensional inputs x. A hyperplane H defined by y(x) = 0 separates the feature space into two decision regions 1 and. The hyperplane has orientation perpendicular to w and distance w 0 / 7 w 7 to the origin. (Figure after the boo from Bishop). The notation y(x) = w T x is often more convenient than (3.3). The networ representation of the discriminant is shown in Fig. 3.. Figure 3.: A networ representation of a two-class linear discriminant.

3 3. Linear discrimination: several classes The two-class case can be extended to K classes by introducing linear discriminant functions y for each class: (3.5) y (x) = w T x + w 0, assigning an input pattern x to class if y (x) > y j (x) for all j. Because y (x) > y j (x) if y (x) y j (x) > 0, the decision boundary between classes and j are given by (3.6) y (x) y j (x) = (w w j ) T x + (w 0 wj 0 ) = 0. The networ representation of (3.6) is setched in Fig. 3.3: y 1 (x)... outputs... y n (x) x 0 = 1 x 1... x M Figure 3.3: Representation of multiple linear discriminant functions. As before in (3.4), we cover the bias by an additional constant input x 0 of unit size and thus may re-write (3.6) as (3.7) y (x) = w T x, where x = (1, x 1,..., x M ) =: (x 0, x 1,..., x M 0 1 M ) and (w, w,..., w ) = w T. The decision regions are now regions in R M+1. They have linear hyperplanes as boundaries, as can be seen from (3.6). Furthermore, the decision regions are connected and convex. To see this, consider two points x A and x B, which both lie in region. Any point x that lies on a line between x A and x B can be written as x = α x A + (1 α) x B for some 0 α 1. From the linearity of the discriminant functions it follows that y ( x ) > y j ( x ) for all j. Therefore all x between x A and x B are in class, too. This is schematically shown in Fig

4 x A i xˆ j x B Figure 3.4 Convexity and connectedness of linear decision regions. 3.3 Learning the weights of a linear discriminant ext we embar on the question how to learn the weights w. A very popular (I dare say by far the most popular) method to learn these weights is to re-interpret the classification tas as a regression tas and then use linear regression to obtain the w. In more detail, this wors as follows. 1. A perfectly woring linear discriminant would yield y (x) = w T x = 1 and y j (x) = w jt x = 0 for j, if the input x belongs to class. Regarded as training data for a regression, the training sample would be a collection (3.8) (, y i ) i = 1,,, where y i = (y 1 i,..., y K i ) T is a K-dimensional indicator vector that is zero everywhere except at position, when the input belongs to class. Seen in this way, the linear discriminant is just a function f: R M R K, f(x) = w T x.. As we described in Section.6, in order to obtain "optimal" weights we need to fix a loss function which defines, in the first place, what "optimal" means. We opt for the most convenient loss function that exists, the square error: (3.9) SE train = ( w x i ), where y i () is the -th entry of the K-dimensional indicator vector y i. The sought optimal weights are then given by (3.10) w ˆ = argmin ( w x i ). w 3. In order to compute the solution of the minimization tas (3.10), we first write out the inner product in (3.10) as a sum (3.11) ( w x i ) = ( w j j ) M 4

5 and then observe that when (3.11) attains its minimum in the M+1-dimensional space of the w 0, w 1,..., w M, all the following partial derivatives must be zero: (3.1) M ( w j j α ) = 0, w where α = 0,, M. This gives M+1 linear equations (3.13) 0 = M ( w j j α ) = w M w ( w j x j α i ) M M M = ( w j j ) w j α w x j = ( w j j i = ( w T x i )x α i = w T α x i y i α ) α 0 1 M for M +1 unnowns w, w,..., w. Writing all training inputs x T i as rows into a (M+1) sized data collection matrix X, all training target outputs y i into a 1 sized data collection vector y, by joining the M+1 equations (3.13) into a single matrix equation we obtain 0 T = w X T X y T X, or equivalently (3.14) w R = y T X, where R = X T X is the (M+1) (M+1) sized correlation matrix of the training inputs x i. By a final assembly step, we join the K equations (3.14) into a single matrix equation (3.15) WR = Y T X, where W is the K (M+1) sized networ weight matrix which contains w in its -th row, and Y is the K sized data collection made from joining all target vectors y i in its rows. Solving (3.15) in a naively straightforward attempt would lead to (3.16) W = Y T X R 1, which however would often not wor in practice because the correlation matrix R may be singular (then the inverse does not exist), or it might be close to singular ("ill-conditioned", then computing the inverse is numerically instable). Especially the latter situation is the rule, not the exception, for most real-life training data X. One remedies this situation by adding a scaled version of the (M+1)-dimensional identity matrix I to R before inverting it: (3.17) W = Y T X (R + c I) 1, which renders the calculation well-defined and numerically stable. 5

6 There is a second, illuminating interpretation of the helper term c I. In Section.8 I mentioned that a common way of regularizing a loss minimization problem is to add a weighted penalty term to the loss function, that is, estimate model parameters by ˆ θ = argmin θ L(D, θ) + c Ω(θ). I furthermore said that one of the most popular regularizers is Ω(θ) = α α θ that is, the sum of squared model parameters. In our regression case, the model parameters are the networ weights W. Thus the regularized version of our optimization problem (3.10) would read (3.18) w ˆ = argmin w ( w x i ) + c w It can be shown (homewor exercise) that (3.17) is the solution to the regularized optimization problem (3.18). Because adding a multiple c I to R loos geometrically lie putting a diagonal "ridge" on R, (3.17) is also widely nown as ridge regression. The more educated terminology is to call it Tychonov regularization after the mathematician who explored this method (in more depth than was indicated here). I conclude this part by a general, abstract formulation of linear regression tass. To state it conveniently, I use the notion of the Frobenius norm of a matrix M: (3.19) M FRO := m ij i, j 1/. Hence, M FRO is the sum of all squared entries of M. The linear regression tas in its general format is to compute w. (3.0) ˆ W = argmin W WX T Y T FRO, where X is a data collection matrix containing data input vectors in its rows and Y contains target output vectors y i in its rows. This problem is usually solved in its Tychonov-regularized version, which gives us the following tae-home message: The Tychonov regularized solution to a linear regression problem W ˆ = argmin WX T Y T ( + c W W FRO FRO ) is ˆ W = Y T X (R + c I) 1. 6

The practical usefulness and wide applicability range of Tychonov-regularized linear regression can hardly be over-estimated. 3.4 Generalized linear discriminants At the end of Section 3.

This may appear to severely limit the usefulness of linear discriminants, because it seems obvious that in many if not almost all real-life problems the class boundaries should have a "nonlinear"

7 The practical usefulness and wide applicability range of Tychonov-regularized linear regression can hardly be over-estimated. 3.4 Generalized linear discriminants At the end of Section 3. we have seen that the decision boundaries obtained through linear discriminants are linear hyperplanes, and the resulting decision regions are convex and connected. This may appear to severely limit the usefulness of linear discriminants, because it seems obvious that in many if not almost all real-life problems the class boundaries should have a "nonlinear" geometry. Consider for example the case of two classes C 1 and C, where inputs x are two-dimensional vectors and x falls into class C 1 iff x lies within the unit circle: Figure 3.5 Two linearly inseparable classes C 1 (red dots) and C (blue crosses) There is no way to separate C 1 from C by a linear discriminant y(x) = w T x which would classify x as belonging to class C 1 whenever y(x) < 0. However, if instead of the original -dimensional input vectors x = (x 1, x ) we would use the onedimensional feature φ(x) = x T x (the squared norm of x) as input, the linear discriminant y(φ(x)) = ( 1, 1) (1, φ(x)) T would clearly do the job. This motivates the introduction of generlized linear discriminants. A generalized l.d. is a two-stage process where the raw M-dimensional inputs x are first passed through a ban of L feature "filters" φ 1,..., φ L, transforming x to an L-dimensional feature vector (φ 1 (x),..., φ L (x)) T, which is then used instead of the raw input x to feed a linear discriminant. The general form of such networs is the following variant of (3.7): (3.1) y (x) = w (1, φ 1 (x),..., φ L (x)), where = 1,..., K is again the index for the classes to be discriminated (= number of output neurons). 7

8 A popular type of such networs is radial basis function networs (RBF networs). If dim(x) = d, each filter φ j is a symmetric ("radial"), typically unimodal function on R d with center µ j. Gaussian density functions are a typical choice. For Gaussians, the output φ j (x) of filter φ j (j > 0) is (3.) φ j (x) = exp x µ j σ j. ote that (i) we do not normalize the Gaussian function here to integral 1, and (ii) although we have an d-dimensional Gaussian, we do not have to care about a covariance matrix Σ because we restrict ourselves to radially symmetric Gaussians. The filter φ j (x) returns large values (close to 1) if the input vector x lies close (in metric distance) to the Gaussian's center µ j, and decreases (with a rate given by σ j ) when the distance grows larger. RBF networs offer the possibility to place many fine-grained filters φ j into regions of the input space X where we need a fine-tuned discrimination, and to be more generous in "less interesting" regions where we plant only a few broad filters. Figure 3.6 shows an example where the raw data points x are one-dimensional and where we want a high discriminiation precision around the origin and around 1. φ1... φl 0 1 X Figure 3.6. Radial basis functions example. Two bacground notes: Remar 1: The performance of RBF networs obviously depends on the proper sizing and placement of the basis functions φ j. These are often optimized by unsupervised training schemes in a data-driven way. In Section 4 we will introduce such an algorithm that is often used with RBF networs. Remar : Any desired input-output mapping f from the original input data space X to the output unit y can be achieved with perfect precision with networs of the ind specified by (3.1). This is trivially clear because you may just use L = K and φ = f and w j = δ j all the wor is done by the filters φ. However, more interesting results state that any desired input-output mapping f can be approximated arbitrarily well with radial basis functions of a given simple class, for instance Gaussians. The art of designing RBF networs is to achieve good performance with as few as possible basis 8

filters because the fewer filters you have, the fewer training data points you need for estimating the networ weights (another instance of the bias-variance dilemma!). A cautionary remar.

9 filters because the fewer filters you have, the fewer training data points you need for estimating the networ weights (another instance of the bias-variance dilemma!). A cautionary remar. The least mean square solution for learning networ weights from data is easy to compute and does not require much thining about the classconditional distributions of the input features. That's good. However, it is not trivial to find good preprocessing filters φ j if you want to tacle nonlinear classification problems (you will need unsupervised learning techniques to optimize them), nor even if you have found good φ j is the least mean square approach necessarily the best you can do for training classificators (because it tends to over-represent extreme or even outlier inputs; you may land far from the optimal weights that would be yielded by a probabilistic approach where you first estimate the posterior class distributions). So there is ample room for improvement. This all said, in practice a linear discriminant trained by minimizing square error often is a quite accurate and certainly a simple way to learn a classificator. 3.5 Perceptrons Historically, the first "neural networs" (not called lie that then) for classification tass were introduced by the psychologist and neurobiologist Fran Rosenblatt in the late 50ies and early 60ies and named Perceptrons. Today we would call them linear discrimination networs. Perceptrons were biologically inspired in a context of visual pattern classification from pixel images. Another characteristic of perceptrons is that they come with a particular type of feature extraction, that is, their input neurons correspond to a particular ind of (linear) features extracted from pixel images, and the values of the output neurons (which we called y) were passed through a binary threshold function f to yield binary classification outputs. Figure 3.7. (redrawn from Bishop's boo) shows the setup of a perceptron. There exists a learning rule for perceptrons that incrementally adapts networ weights for maximal discrimination rates; this rule can be proven to converge. Figure 3.7: The perceptron's input neurons φ j are patched to the input pattern by random linear lins (which maes the φ j linear and the total behavior of the perceptron linear too). They typically compute their outputs by a threshold function from the sum of the signals received through these lins. Input neuron outputs are weighted, summed, and passed through another threshold function f whose output indicates whether the pattern belongs to class 1 or class (binary classification). 9

Rosenblatt first implemented his perceptrons on one of the early digital computers, the IBM Mar 1.

8 shows some impressions of this physical neural networ. (a) (b) (c) Fig. 3.8: Rosenblatts analog-electronic realization of the perceptron.

(c): Adaptive weights realized by motor-driven potentiometers (all images from Bishop 006 1, Chapter 4.

Here is a sniplet from the ew Yor Times ("ew avy Device Learns by Doing", YT July 8, 1958), after a press conference held by Rosenblatt at the US Office of aval Research on July 7, 1958 (cited after

10 Rosenblatt first implemented his perceptrons on one of the early digital computers, the IBM Mar 1. ot satisfied with the performance, he proceeded to implement perceptrons in analog hardware, with the adaptive networ weights realized by potentiometers driven by electrical motors. Figure 3.8 shows some impressions of this physical neural networ. (a) (b) (c) Fig. 3.8: Rosenblatts analog-electronic realization of the perceptron. (a): Pattern input: brightly lit B/W patterns recorded by an array of 0 x 0 photocells. (b): "neuronal wiring". (c): Adaptive weights realized by motor-driven potentiometers (all images from Bishop 006 1, Chapter 4.1) Perceptrons led to an early hype (several others to follow) in how artificial intelligence was perceived in the general public. Here is a sniplet from the ew Yor Times ("ew avy Device Learns by Doing", YT July 8, 1958), after a press conference held by Rosenblatt at the US Office of aval Research on July 7, 1958 (cited after Olazaran 1996 ): Perceptron research suffered a sudden and dramatic death when Marvin Minsy and Seymour Papert, pioneers of (symbolic) AI, published a boo Perceptrons in 1969 where they put their finger on what you have learnt earlier in this lecture: the perceptron as it were could only distinguish between pattern classes that were linearly separable. In particular, the XOR function which maps binary input pairs to their XOR cannot be learnt by a perceptron. This (obvious, by today's standards) insight shattered neural networ research and sent it into a sleep from which it woe up only a little less than 0 years later, when multilayer perceptrons (MLPs) which can learn the XOR tas became trainable by the bacpropagation algorithm. We will learn about MLPs later in this course. Today certain versions of MLPs are the most powerful ML methods that exist for a number of highly relevant applications: speech recognition, automated text translation, image classification, handwriting recognition. Except for the reproduction capability and consciousness predicted in the YT 1 Bishop, C. M. (006), Pattern Recognition and Machine Learning. Springer Verlag Olazaran, M. (1996), A sociological study of the official history of the Perceptrons Controversy. Social Studies of Science 6(3),

11 article, the other capabilities have come close to be mastered by artificial neural networs. Perceptrons with their original learning rule are still sometimes used due to their simplicity and their ancient fame, though hardly by ML professionals. 11

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS Problems 2 Please do Problem 8.3 in the textbook. We will discuss this in class. Classification: Problem Statement 3 In regression, we are modeling the relationship