The Perceptron. 4.1 Introduction. Organization of the Chapter

Size: px
Start display at page:

Download "The Perceptron. 4.1 Introduction. Organization of the Chapter"

Transcription

1 L 4 L l The Perceptron 4.1 Introduction The perceptron is the simplest form of a neural network used for the classification of a special type of patterns said to be linearly separable (i.e., patterns that lie on opposite sides of a hyperplane). Basically, it consists of a single neuron with adjustable synaptic weights and threshold, as shown in Fig The algorithm used to adjust the free parameters of this neural network first appeared in a learning procedure developed by Rosenblatt (1958, 1962) for his perceptron brain model. Indeed, Rosenblatt proved that if the patterns (vectors) used to train the perceptron are drawn from two linearly separable classes, then the perceptron algorithm converges and positions the decision surface in the form of a hyperplane between the two classes. The proof of convergence of the algorithm is known as the perceptron convergence theorem. The single-layer perceptron depicted in Fig. 4.1 has a single neuron. Such a perceptron is limited to performing pattern classification with only two classes. By expanding the output (computation) layer of the perceptron to include more than one neuron, we may correspondingly form classification with more than two classes. However, the classes would have to be linearly separable for the perceptron to work properly. Accordingly, insofar as the basic theory of a single-layer perceptron as a pattern classifier is concerned, we need consider only the single-neuron configuration shown in Fig The extension of the theory so developed to the case of more than one neuron is a trivial matter. Organization of the Chapter This chapter on the single-layer perceptron is relatively short, but it is important for historical and theoretical reasons. In Section 4.2 we present some basic considerations involved in the operation of the perceptron. Then, in Section 4.3 we describe Rosenblatt s original algorithm for adjusting the synaptic weight vector of the perceptron for pattern classification of linearly separable classes, and demonstrate convergence of the algorithm. This is followed by the description of a performance measure for the single-layer perceptron in Section 4.4. In Section 4.5 we consider the relationship between the single-layer perceptron and the maximum-likelihood Gaussian classifier. The chapter concludes with general discussion in Section 4.6. The network organization of the original version of the perceptron, as envisioned by Rosenblatt (1962), has three types of units: sensory units, association units, and response units. The connections from the sensory units to the association units have fixed weights, and the connections from the association units to the response units have variable weights. The association units act as preprocessors designed to extract a pattern from the environmental input. Insofar as the variable weights are concerned, the operation of Rosenblatt s original perceptron is essentially the same as that depicted in Fig. 4.1 for the case of a single response unit. 106

2 4.2 / Basic Considerations 107 Inputs FIGURE 4.1 Single-layer perceptron. 4.2 Basic Considerations Basic to the operation of Rosenblatt's perceptron is the McCulloch-Pitts model of a neuron. From Chapter 1, we recall that such a neuron model consists of a linear combiner followed by a hard limiter, as depicted in Fig The summing node of the neuron model computes a linear combination of the inputs applied to its synapses, and also accounts for an externally applied threshold. The resulting sum is applied to a hard limiter. Accordingly, the neuron produces an output equal to +1 if the hard limiter input is positive, and -1 if it is negative. In the signal-flow graph model of Fig. 4.2, the synaptic weights of the single-layer perceptron are denoted by wlr w2,..., w,. (We have simplified the notation here by not including an additional subscript to identify the neuron, since we only have to deal with a single neuron.) Correspondingly, the inputs applied to the perceptron are denoted by xlr xz,..., xp. The externally applied threshold is denoted by 8. From the model of Fig. 4.2 we thus find that the linear combiner output (i.e., hard limiter input) is u = C wixi - e i=l (4.1) The purpose of the perceptron is to classify the set of externally applied stimuli xl, xz,..., xp into one of two classes, Cel or %z, say. The decision rule for the classification is to assign the point represented by the inputs xl, xz,..., xp to class (e if the perceptron output y is + 1 and to class CeZ if it is - 1. To develop insight into the behavior of a pattern classifier, it is customary to plot a map of the decision regions in the p-dimensional signal space spanned by the p input variables xl, xz,..., xp. In the case of an elementary perceptron, there are two decision regions separated by a hyperplane defined by i: wixi - i=l 1. X1 \ w. x2 4 Inputs /' Threkld P FIGURE 4.2 Signal-flow graph of the perceptron. o=o (4.2)

3 108 4 / The Perceptron This is illustrated in Fig. 4.3 for the case of two input variables x, and xz, for which the decision boundary takes the form of a straight line. A point (xl,xz) that lies above the boundary line is assigned to class %1, and a point (xi,xz) that lies below the boundary line is assigned to class (e2. Note also that the effect of the threshold 8 is merely to shift the decision boundary away from the origin. The synaptic weights wlr wz,..., wp of the perceptron can be fixed or adapted on an iteration-by-iteration basis. For the adaptation, we may use an error-correction rule known as the perceptron convergence algorithm, developed in the next section. 4.3 The Perceptron Convergence Theorem For the development of the error-correction learning algorithm for a single-layer perceptron, we find it more convenient to work with the modified signal-flow graph model of Fig In this second model, which is equivalent to that of Fig. 4.2, the threshold e(n) is treated as a synaptic weight connected to a fixed input equal to - 1. We may thus define the (p + 1)-by-1 input vector Correspondingly, we define the (p + 1)-by-1 weight vector w(n) = w&), wh),..., wp(n)lr (4.4) Accordingly, the linear combiner output is written in the compact form u(n) = wt(n)x(n) (4.5) For fixed n, the equation wtx = 0, plotted in a p-dimensional space with coordinates xl, x2,..., x,, defines a hyperplane as the decision surface between two different classes of inputs. Suppose then the input variables of the single-layer perceptron originate from two linearly separable classes that fall on the opposite sides of some hyperplane. Let XI be the subset of training vectors xl(l), x1(2),... that belong to class Gel, and let X2 be the x2 I X1 Decision boundary wlxl + w2x2-8 = 0 FIGURE 4.3 Illustrating linear reparability for a two-dimensional, two-class patternclassification problem.

4 4.3 I The Perceptron Convergence Theorem 109 Fixed x,,=-io, input Inputs FIGURE 4.4 xo.. W Y Linear combiner Hard limiter Equivalent signal-flow graph of the perceptron. Y subset of training vectors x2(1), x2(2),... that belong to class S2. The union of XI and X, is the complete training set X. Given the sets of vectors XI and Xz to train the classifier, the training process involves the adjustment of the weight vector w in such a way that the two classes and %, are separable. These two classes are said to be linearly separable if a realizable setting of the weight vector w exists. Conversely, if the two classes %I and %, are known to be linearly separable, then there exists a weight vector w such that we may state and wtx 2 0 for every input vector x belonging to class wrx < 0 for every input vector x belonging to class %, Given the subsets of training vectors XI and X2, the training problem for the elementary perceptron is then to find a weight vector w such that the two inequalities of Eq. (4.6) are satisfied. The algorithm for adapting the weight vector of the elementary perceptron may now be formulated as follows: 1. If the nth member of the training vector, x(n), is correctly classified by the weight vector w(n) computed at the nth iteration of the algorithm, no correction is made to the weight vector of the perceptron, as shown by and w(n + 1) = w(n) if w'(n>x(n) 2 0 and x(n) belongs to class w(n + 1) = w(n) if w'(n)x(n) < 0 and x(n) belongs to class %, 2. Otherwise, the weight vector of the perceptron is updated in accordance with the rule w(n + 1) = w(n) - ~(n)x(n) if wt(n)x(n) 2 0 and x(n) belongs to class sz and w(n + 1) = w(n) + 7(n)x(n) if w'(n)x(n) < 0 and x(n) belongs to class (4.6) (4.7) where the learning-rate parameter ~ (n) controls the adjustment applied to the weight vector at iteration n. If q(n) = 7 > 0, where 7 is a constant independent of the iteration number n, we have a fied increment adaptation rule for the perceptron. In the sequel, we first prove the convergence of a fixed increment adaptation rule for which 7 = 1. Clearly, the value of 77 is unimportant, so long as it is positive. A value (4.8)

5 110 4 / The Perceptron of q f: 1 merely scales the pattern vectors without affecting their separability. The case of a variable q(n) is considered later in the section. The proof is presented for the initial condition w(0) = 0. Suppose that wt(n)x(n) < 0 for n = 1, 2,... and an input vector x(n) that belongs to the subset X,. That is, the perceptron incorrectly classifies the vectors x(l), x(2),..., since the second condition of Eq. (4.6) is violated. Then, with the constant ~(n) = 1, we may use the second line of Eq. (4.8) to write w(n + 1) = w(n) + x(n) for x(n) belonging to class (4.9) Given the initial condition w(0) = 0, we may iteratively solve this equation for w(n + l), obtaining the result w(n + 1) = x(1) + x(2) + * x(n) (4.10) Since the classes (el and %2 are assumed to be linearly separable, there exists a solution wo for which wtx(n) > 0 for the vectors x( I),..., x(n) belonging to the subset X,. For a fixed solution wo, we may then define a positive number a by the relation (4.1 1) where x(n) E Xl stands for x(n) belongs to subset XI. Hence, multiplying both sides of Eq. (4.10) by the row vector wi, we get wiw(n + 1) = WtX(1) + WtX(2) + *.. + wtx(n> Accordingly, in light of the definition given in Eq. (4.1 l), we have (4.12) wiw(n + 1) 2 na (4.13) Next, we make use of an inequality known as the Cauchy-Schwarz inequality. Given two vectors wo and w(n + l), the Cauchy-Schwarz inequality states that llwol1211w(n + 1)Ip 2 [wiw(n + 1)12 (4.14) where /(*I1 denotes the Euclidean norm of the enclosed argument vector, and the inner product w;w(n + 1) is a scalar quantity. We now note from Fiq. (4.13) that [wiw(n + 1)12 is equal to or greater than n2a2. Moreover, from Eq. (4.14) we note that Ilwol1211w(n + 1)112 is equal to or greater than [w;w(n + 1)12. It follows therefore that or, equivalently, Ilwo(1211w(n + 1)y 2 n2a2 (4.15) Next, we follow another development route. In particular, we rewrite Eq. (4.9) in the form w(k + 1) = w(k) + x(k) fork = 1,..., n and x(k) E XI (4.16) Hence, taking the squared Euclidean norm of both sides of Eq. (4.16), we get Ilw<k + 1)112 = lw(k>112 + llx(k)1l2 + 2wT(k)x(k) (4.17) But, under the assumption that the perceptron incorrectly classifies an input vector x(k) belonging to the subset X,, we have wt(k)x(k) < 0. We therefore deduce from Eq. (4.17) that Ilw@ + 1)IP 5 llw(w + Ilx(k)l12

6 or, equivalently, Ilw<k / The Perceptron Convergence Theorem ll~(k)((~ 5 llx(k)112, k = 1,..., (4.18) Adding these inequalities for k = 1,..., n, and assuming that the initial condition w(0) = 0, we get the following condition: n liw(n + 1)112 5 c llx(k>ll" k=l 5 np (4.19) where P is a positive number defined by P = liw12 (4.20) Equation (4.19) states that the squared Euclidean norm of the weight vector w(n + 1) grows at most linearly with the number of iterations n. Clearly, the second result of Eq. (4.19) is in conflict with the earlier result of Eq. (4.15) for sufficiently large values of n. Indeed, we can state that n cannot be larger than some value n,, for which Eqs. (4.15) and (4.19) are both satisfied with the equality sign. That is, n,, is the solution of the equation Solving for nmx for a solution vector wo, we find that (4.21) We have thus proved that for q(n) = I for all n, and w(0) = 0, and given that a solution vector wo exists, the rule for adapting the synaptic weights connecting the associator units to the response unit of the perceptron must terminate after at most n,, iterations. Note also from Eqs. (4.11), (4.20), and (4.21) that there is no unique solution for wo or nm,. We may now state the Jixed-increment convergence theorem for a single-layer perceptron as follows (Rosenblatt, 1962): Let the subsets of training vectors X, and X, be linearly separable. Let the inputs presented to the single-layer perceptron originate from these two subsets. The perceptron converges after some no iterations, in the sense that w(no) = w(no + 1) = w(no + 2) =... is a solution vector for no 5 n,,. Consider next the absolute error-correction procedure for the adaptation of a singlelayer perceptron, for which q(n) is variable. In particular, let q(n) be the smallest integer for which rp(w(n)x(n> > lwwx(n)l With this procedure we find that if the inner product w'(n)x(n) at iteration n has an incorrect sign, then wr(n + l)x(n) at iteration n + 1 would have the correct sign. This suggests that if w'(n)x(n) has an incorrect sign, we may modify the training sequence at iteration n + 1 by setting x(n + 1) = x(n). In other words, each pattern is presented repeatedly to the perceptron until that pattern is classified correctly. Note also that the use of an initial value w(0) different from the null condition merely results in a decrease or increase in the number of iterations required to converge, depending

7 112 4 / The Perceptron on how w(0) relates to the solution wo. Hence, regardless of the value assigned to w(o), the single-layer perceptron of Fig. 4.1 is assured of convergence. Summary In Table 4.1 we present a summary of the perceptron convergence algorithm (Lippmann, 1987). The symbol sgn(-), used in step 3 of the table for computing the actual response of the perceptron, stands for the signum finction: +1 ifu>0 sgn(u) = (4.22) -1 ifu<o TABLE 4.1 Summary of the Perceptron Convergence Algorithm Variables and Parameters x(n) = ( p + 1)-by-1 input vector = [- 1, xdn), xz(n),... 1 x,(n)l' w(n) = ( p + 1)-by-1 weight vector = [B(n), w1(n), w&),..., wp(n>lt B(n) = threshold y(n) = actual response (quantized) d(n) = desired response 77 = learning-rate parameter, a positive constant less than unity Step 7: Initialization Set w(0) = 0. Then perform the following computations for time n = 1, 2,.... Step 2: Activation At time n, activate the perceptron by applying continuous-valued input vector x(n) and desired response d(n). Compute the actual response of the perceptron: where sgn(-) is the signum function. Update the weight vector of the perceptron: where Step 3: Computation of Actual Response y(n) = sgn[w'(n)x(n)l Step 4: Adaptation of Weight Vector w(n + 1) = w(n) + rl[d(n) - y(n)lx(n) + 1 d(n) = {- 1 if x(n) belongs to class if x(n) belongs to class Step 5 Increment time n by one unit, and go back to step 2.

8 4.4 I Performance Measure 113 In effect, the signum function is a shorthand notation for the asymmetric form of hard limiting. Accordingly, the quantized response y(n) of the perceptron is written in the compact form An) = sgn(w'(n)x(n)) (4.23) Note that the input vector x(n) is a ( p + 1)-by-1 vector whose first element is fixed at - 1 throughout the computation in accordance with the definition of Eq. (4.3). Correspondingly, the weight vector w(n) is a ( p + 1)-by-1 vector whose first element equals the threshold qn), as defined in Eq. (4.4). In Table 4.1 we have also introduced a quantized desired response d(n), defined by + 1 if x(n) belongs to class d(n) = (4.24) -- 1 if x(n) belongs to class %* Thus, the adaptation of the weight vector w(n) is summed up nicely in the form of an error-correction learning rule, as shown by w(n + 1) = w(n) + q[d(n) - y(n)]x(n) (4.25) where q is the Zeaming-rate parameter, and the difference d(n) - y(n) plays the role of an error signal. The learning-rate parameter is a positive constant limited to the range 0 < q 5 1. In assigning a value to it inside this range, we have to keep in mind two conflicting requirements (Lippmann, 1987): Averaging of past inputs to provide stable weight estimates, which requires a small q Fast adaptation with respect to real changes in the underlying distributions of the process responsible for the generation of the input vector x, which requires a large Performance Measure In the previous section we derived the perceptron convergence algorithm without any reference to a performance measure, yet in Chapter 2 we emphasized that improvement in the performance of a learning machine takes place over time in accordance with a performance measure. What then is a suitable performance measure for the single-layer perceptron? An obvious choice would be the auerage probability of classijication error, defined as the average probability of the perceptron making a decision in favor of a particular class when the input vector is drawn from the other class. Unfortunately, such a performance measure does not lend itself readily to the analytic derivation of a learning algorithm. Shynk (1990) has proposed the use of a performance function that befits the operation of a perceptron, as shown by the statistical expectation J = -E[e(n)u(n)] (4.26) where e(n) is the error signal defined as the difference between the desired response d(n) and the actual response y(n) of the perceptron, and v(n) is the linear combiner output (Le., hard-limiter input). Note that the error signal e(n) is not a quantized signal, but rather the difference between two quantized signals. The instantaneous estimate of the performance function is the time-varying function S(n) = -e(n>v(n) = - y(n>lv(n) (4.27)

9 114 4 I The Perceptron The instantaneous gradient vector is defined as the derivative of the estimate &n) with respect to the weight vector w(n), as shown by (4.28) Hence, recognizing that the desired response d(n) is independent of w(n) and the fact that the actual response y(n) has a constant value equal to - 1 or + 1, we find that the use of Eq. (4.27) in (4.28) yields 2 For the defining equation (4.5), we readily find that 3 Thus the use of Eq. (4.30) in (4.29) yields the result v w. m = -[d(n) - y(n)lx(n) (4.29) (4.30) (4.3 1) Hence, according to the error-correction rule described in Chapter 2 we may now express the change applied to the weight vector as Aw(~) = - qv,.t(n) = q[d(n) - y(n>lx(n) (4.32) To be precise, we should also include a Dirac delta function (Le., impulse function) whenever the perceptron output y(n) goes through a transition at u(n) = 0. Averaged over a long sequence of iterations, however, such a transition has a negligible effect. Consider a scalar quantity u defined as the inner product of two real-valued p-by-1 vectors w and x, as shown by v = WTX = 2 wix, i=l where wi and xi are the ith elements of the I vectors : : w [ and x, respectively. The partial derivative of the quantity v with respect to the vector w is defined by (Haykin, 1991) auiaw, -= av aw From the given definition of u, we note that P We therefore have -- av - awi x,, i=1,2,..., p

10 4.5 / Maximum-Likelihood Gaussian Classifier 115 where r) is the learning-rate parameter. This is precisely the correction made to the weight vector as we progress from iteration n to n -I- 1, as described in Eq. (4.25). We have thus demonstrated that the instantaneous performance function j(n) defined in Eq. (4.27) is indeed the correct performance measure for a single-layer perceptron (Shynk, 1990). 4.5 Maximum-Likelihood Gaussian Classifier A single-layer perceptron bears a close relationship to a classical pattern classifier known as the maximum-likelihood Gaussian ClassiJier, in that they are both examples of linear classijiers (Duda and Hart, 1973). The maximum-likelihood method is a classical parameter-estimation procedure that views the parameters as quantities whose values are@ed but unknown. The best estimate is defined to be the one that maximizes the probability of obtaining the samples actually observed. To be specific, suppose that we separate a set of samples according to class, so that we have M subsets of samples denoted by XI, X,,..., X,. Let the samples in subset Xj be drawn independently according to a probability law of known parametric form. Let this probability law be described by the conditional probability densityfunction f(xlwj), where x is an observation vector and wj represents an unknown parameter vector associated with class Tj, j = 1, 2,..., M. We further assume that the samples in X, give no information about wj if i # j, that is, the parameters for the different classes are functionally independent. Now, viewed as a function of w, f(x)w) is called the likelihood of w with respect to the observation vector x. The maximum-likelihood estimate of w is, by definition, that value ii. which maximizes the likelihood f(x)w). To proceed with the issue at hand, consider a p-by-1 observation vector x defined by x = [XI, x2,..., Xp]T (4.33) Note that the vector x as defined here is confined exclusively to the externally applied inputs. The observation vector x is described in terms of a mean vector p and covariance matrix C, which are defined by, respectively, and p = E[xl (4.34) c = - p)(x - PYI (4.35) where E is the expectation operator. Assuming that the vector x is Gaussian-distributed, we may express the joint-probability density function of the elements of x as follows (Van Trees, 1968): (4.36) where det C is the determinant of the covariance matrix C, the matrix C-' is the inverse of C, and p is the dimension of the vector x. To be more specific, suppose that we have a two-class problem, with the vector x characterized as follows, depending on whether it belongs to class or class %,: x EX,: x EX,: meanvector = p1 covariance matrix = C mean vector = p2 covariance matrix = C

11 (4.44) / The Perceptron In other words, the vector x has a mean vector the value of which depends on whether it belongs to class %1 or class (e2, and a covariance matrix that is the same for both classes. It is further assumed that: rn The classes %l and (e2 are equiprobable. rn The samples of class or (e2 are correlated, so that the covariance matrix C is nondiagonal. rn The covariance matrix C is nonsingular, so that the inverse matrix C-' exists. We may then use Eq. (4.36) to express the joint-probability density function of the input vector x as follows: (4.37). where the index i = 1, 2. Taking the natural logarithm of both sides of Eq. (4.37) and then expanding terms, we get P lnf(xl%j = -- ln(2n) -- ln(det C) -- xtc-'x + prc-'x -- ptc-'pl (4.38) The first three terms on the right-hand side of Eq. (4.38) are independent of the index i. Hence, for the purpose of pattern classification considered here, we may define a loglikelihood as follows: 1 &(X) = ptc-lx -- ptc-'pi 2 where i = 1, 2. For class TI, the log-likelihood is 1 ll(x) = p p x -- 2 pfc-'pl (4.39) (4.40) For class %2, the log-likelihood is 1 &(x) = p p x - 5 p;c-1p, (4.41) Hence, subtracting Eq. (4.41) from (4.40), we get 1 = &(x) - Iz(x) 1 = (p1- p2)tc-1x --(pp-lp' - p2tc-'pz) 2 (4.42) which is linearly related to the input vector x. To emphasize this property of the composite log-likelihood 1, we rewrite Eq. (4.42) in the equivalent form 1 = fitx - 8 = i:*jxj - e i= 1 (4.43) where fi is the maximum-likelihood estimate of the unknown parameter vector w, defined by w = c-yp.1 - p*) I and 8 is a constant threshold, defined by 1 8 = - (pfc-1 p1- p;c-'p2) 2 (4.45)

12 4.5 I Maximum-Likelihood Gaussian Classifier 117 The maximum-likelihood Gaussian classijer or Gaussian classijier (for short) may thus be implemented by a linear combiner as shown in the signal-flow graph of Fig Equations (4.44) and (4.45) simplify considerably for the following special case. All the samples of the vector x are uncorrelated with each other, and also have a common variance uz, regardless of the associated class; that is, the covariance matrix C is a diagonal matrix defined by c = u2i (4.46) where I is the p-by-p identity matrix. We now have C-'= -1 1 u 2 (4.47) Accordingly, Eqs. (4.44) and (4.45) simplify as, respectively, 1 %=-( ( ) (4.48) and (4.49) where )).I) denotes the Euclidean norm of the enclosed vector. In any case, treating 1 as the overall classifier output, the Gaussian classifier, characterized by the parameter vector % and threshold 8, solves the pattern-classification problem in accordance with the following rule: If 1 2 0, then 1, 2 lz and therefore assign x to class (el If 1 < 0, then lz > li and therefore assign x to class (ez (4.50) The operation of the Gaussian classifier is analogous to that of the single-layer perceptron in that they are both linear classifiers; see Eqs. (4.2) and (4.43). There are, however, some subtle and important differences between them, which should be carefully noted as discussed here: w The single-layer perceptron operates on the premise that the patterns to be classified are linearly separable. The Gaussian distribution of the two patterns assumed in the derivation of the maximum-likelihood Gaussian classifier do certainly overlap each other and are therefore not exactly separable; the extent of the overlap is determined by the mean vectors pl and p2, and the covariance matrices C1 and Cz. The nature of this overlap is illustrated in Fig. 4.6 for the special case of a scalar input x (i.e., input dimension p = 1). When the inputs are nonseparable and their distributions overlap as described here, the perceptron convergence algorithm develops a problem FIGURE 4.5 Signal-flow graph of Gaussian classifier.

13 118 4 I The Perceptron Decision boundary X FIGURE 4.6 % % Two overlapping, one-dimensional Gaussian distributions. in that the decision boundaries between the different classes may oscillate continuously (Lippmann, 1987). The maximum-likelihood Gaussian classifier minimizes the average probability of classification error. This minimization is independent of the overlap between the underlying Gaussian distributions of the two classes. For example, in the special case illustrated in Fig. 4.6, the maximum-likelihood Gaussian classifier always positions the decision boundary at the point where the two Gaussian dstributions cross each other. This positioning of the decision boundary is perfectly in accord with the optimum Bayes 's hypothesis-testing procedure on the assumption that the two classes are equiprobable (ie., there is no prior information to be incorporated into the decision making). The perceptron convergence algorithm is nonparametric in the sense that it makes no assumptions concerning the form of the underlying distributions; it operates by concentrating on errors that occur where the distributions overlap. It may thus be more robust than classical techniques, and work well when the inputs are generated by nonlinear physical mechanisms, and whose distributions are heavily skewed and non-gaussian (Lippmann, 1987). In contrast, the maximum-likelihood Gaussian classifier is parametric; its derivation is contingent on the assumption that the underlying distributions are Gaussian, which may limit its area of application. The perceptron convergence algorithm is both adaptive and simple to implement; its storage requirement is confined to the set of synaptic weights and threshold. On the other hand, the design of the maximum-likelihood Gaussian classifier is fixed; it can be made adaptive, but at the expense of increased storage requirement and more complex computations (Lippmann, 1987). 4.6 Discussion The study of the single-layer perceptron presented in this chapter has been formulated in the context of the McCulloch-Pitts formal model of a neuron. The nonlinear element of this model is represented by a hard limiter at its output end. It would be tempting to think that we can do better if we were to use a sigmoidal nonlinear element in place of the hard limiter. Well, it turns out that the steady-state, decision-making characteristics of a single-layer perceptron are basically the same, regardless of whether we use hard-limiting or soft-limiting as the source of nonlinearity in the neural model (Shynk and Bershad,

14 4.6 I Discussion , 1992; Shynk, 1990). We may therefore state formally that so long as we limit ourselves to the model of a neuron that consists of a linear combiner followed by a nonlinear element, then regardless of the form of nonlinearity used, a single-layer perceptron can perform pattern classification only on linearly separable patterns. Linear separability requires that the patterns to be classified must be sufficiently separated from each other to ensure that the decision surfaces consist of hyperplanes. This requirement is illustrated in Fig. 4.7a for the case of a two-dimensional, single-layer perceptron. If now the two patterns TI and (e2 are allowed to move too close to each other, as in Fig. 4.7b, they become nonlinearly separable and the elementary perceptron fails to classify them. The first real critique of Rosenblatt s perceptron was presented by Minsky and Selfridge (1961). Minsky and Selfridge pointed out that the perceptron as defined by Rosenblatt could not even generalize toward the notion of binary parity, let alone make general abstractions. They also suggested the roles that connectionist networks might play in the implementation of larger systems. The computational limitations of Rosenblatt s perceptron were subsequently put on a solid mathematical foundation in the famous book by Minsky and Papert (1969, 1988). This is a thorough and well-written book. After the presentation of some brilliant and highly detailed mathematical analysis of the perceptron, Minsky and Papert proved that the perceptron as defined by Rosenblatt is inherently incapable of making some global generalizations on the basis of locally learned examples. In the last chapter of their book, Minsky and Papert go on to make the conjecture that the limitations of the kind they had discovered for Rosenblatt s perceptron would also hold true for its variants, more specifically, multilayer neural networks. Quoting from Section 13.2 of the book by Minsky and Papert (1969): FIGURE 4.7 patterns. ( b) (a) A pair of linearly separable patterns. (b) A pair of nonlinearly separable

15 120 4 / The Perceptron The perceptron has shown itself worthy of study despite (and even because of!) its severe limitations. It has many features to attract attention: its linearity; its intriguing learning theorem; its clear paradigmatic simplicity as a kind of parallel computation. There is no reason to suppose that any of these virtues carry over to the many-layered version. Nevertheless, we consider it to be an important research problem to elucidate (or reject) our intuitive judgement that the extension to multilayer systems is sterile. This conjecture cast serious doubt on the computational capabilities of not only the perceptron but neural networks in general up to the mid-1980s. History has shown, however, that the conjecture made by Minsky and Papert seems to be unjustified in that we now have several advanced forms of neural networks that are computationally more powerful than Rosenblatt s perceptron. For example, multilayer perceptrons trained with the back-propagation algorithm discussed in Chapter 6 and the radial basis-function networks discussed in Chapter 7 overcome the computational limitations of the single-layer perceptron Equations (4.44) and (4.45) define the weight vector and threshold of a maximumlikelihood Gaussian classifier, assuming that the samples of classes or %z are 4.5 In Eq. (4.27) we defined one form of an instantaneous performance function for a I single-layer perceptron. Another way of expressing this performance measure is as follows (Shynk, 1990): 4.6 The single-layer perceptron is a linear pattern classifier. Justify this statement. Consider two one-dimensional, Gaussian-distributed classes and %z that have a common variance equal to 1. Their mean values are as follows: p1 = -10 pz = 4-10 These two classes are essentially linearly separable. Design a classifier that separates these two classes. Verify that Eqs. (4.22)-(4.25), summarizing the perceptron convergence algorithm, are consistent with Eqs. (4.7) and (4.8). correlated. Suppose now that these samples are uncorrelated but have different variances, so that for both classes we may write C = diag[cr:, (T;,..., cr3 Find the weight vector and threshold of the classifier so defined. j(n) = Iu(n)l - d(n) u(n) where u(n) is the hard-limiter input, and d(n) is the desired response. Using this performance measure, find the instantaneous gradient vector V,j(n), and show that the expression so derived is identical with that in Eq. (4.31). The perceptron may be used to perform numerous logic functions. Demonstrate the implementation of the binary logic functions AND, OR, and COMPLEMENT. However, a basic limitation of the perceptron is that it cannot implement the EXCLU- SIVE OR function. Explain the reason for this limitation.

Rosenblatt s Perceptron

Rosenblatt s Perceptron M01_HAYK1399_SE_03_C01.QXD 10/1/08 7:03 PM Page 47 C H A P T E R 1 Rosenblatt s Perceptron ORGANIZATION OF THE CHAPTER The perceptron occupies a special place in the historical development of neural networks:

More information

3.4 Linear Least-Squares Filter

3.4 Linear Least-Squares Filter X(n) = [x(1), x(2),..., x(n)] T 1 3.4 Linear Least-Squares Filter Two characteristics of linear least-squares filter: 1. The filter is built around a single linear neuron. 2. The cost function is the sum

More information

In the Name of God. Lecture 11: Single Layer Perceptrons

In the Name of God. Lecture 11: Single Layer Perceptrons 1 In the Name of God Lecture 11: Single Layer Perceptrons Perceptron: architecture We consider the architecture: feed-forward NN with one layer It is sufficient to study single layer perceptrons with just

More information

PMR5406 Redes Neurais e Lógica Fuzzy Aula 3 Single Layer Percetron

PMR5406 Redes Neurais e Lógica Fuzzy Aula 3 Single Layer Percetron PMR5406 Redes Neurais e Aula 3 Single Layer Percetron Baseado em: Neural Networks, Simon Haykin, Prentice-Hall, 2 nd edition Slides do curso por Elena Marchiori, Vrije Unviersity Architecture We consider

More information

Single layer NN. Neuron Model

Single layer NN. Neuron Model Single layer NN We consider the simple architecture consisting of just one neuron. Generalization to a single layer with more neurons as illustrated below is easy because: M M The output units are independent

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table

More information

Lecture 7 Artificial neural networks: Supervised learning

Lecture 7 Artificial neural networks: Supervised learning Lecture 7 Artificial neural networks: Supervised learning Introduction, or how the brain works The neuron as a simple computing element The perceptron Multilayer neural networks Accelerated learning in

More information

18.6 Regression and Classification with Linear Models

18.6 Regression and Classification with Linear Models 18.6 Regression and Classification with Linear Models 352 The hypothesis space of linear functions of continuous-valued inputs has been used for hundreds of years A univariate linear function (a straight

More information

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition Last updated: Oct 22, 2012 LINEAR CLASSIFIERS Problems 2 Please do Problem 8.3 in the textbook. We will discuss this in class. Classification: Problem Statement 3 In regression, we are modeling the relationship

More information

CS 4700: Foundations of Artificial Intelligence

CS 4700: Foundations of Artificial Intelligence CS 4700: Foundations of Artificial Intelligence Prof. Bart Selman selman@cs.cornell.edu Machine Learning: Neural Networks R&N 18.7 Intro & perceptron learning 1 2 Neuron: How the brain works # neurons

More information

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs) Multilayer Neural Networks (sometimes called Multilayer Perceptrons or MLPs) Linear separability Hyperplane In 2D: w x + w 2 x 2 + w 0 = 0 Feature x 2 = w w 2 x w 0 w 2 Feature 2 A perceptron can separate

More information

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs) Multilayer Neural Networks (sometimes called Multilayer Perceptrons or MLPs) Linear separability Hyperplane In 2D: w 1 x 1 + w 2 x 2 + w 0 = 0 Feature 1 x 2 = w 1 w 2 x 1 w 0 w 2 Feature 2 A perceptron

More information

Chapter ML:VI. VI. Neural Networks. Perceptron Learning Gradient Descent Multilayer Perceptron Radial Basis Functions

Chapter ML:VI. VI. Neural Networks. Perceptron Learning Gradient Descent Multilayer Perceptron Radial Basis Functions Chapter ML:VI VI. Neural Networks Perceptron Learning Gradient Descent Multilayer Perceptron Radial asis Functions ML:VI-1 Neural Networks STEIN 2005-2018 The iological Model Simplified model of a neuron:

More information

Neural Networks biological neuron artificial neuron 1

Neural Networks biological neuron artificial neuron 1 Neural Networks biological neuron artificial neuron 1 A two-layer neural network Output layer (activation represents classification) Weighted connections Hidden layer ( internal representation ) Input

More information

Artificial Neural Network

Artificial Neural Network Artificial Neural Network Eung Je Woo Department of Biomedical Engineering Impedance Imaging Research Center (IIRC) Kyung Hee University Korea ejwoo@khu.ac.kr Neuron and Neuron Model McCulloch and Pitts

More information

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines Pattern Recognition and Machine Learning James L. Crowley ENSIMAG 3 - MMIS Fall Semester 2016 Lessons 6 10 Jan 2017 Outline Perceptrons and Support Vector machines Notation... 2 Perceptrons... 3 History...3

More information

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD WHAT IS A NEURAL NETWORK? The simplest definition of a neural network, more properly referred to as an 'artificial' neural network (ANN), is provided

More information

Neural Networks and the Back-propagation Algorithm

Neural Networks and the Back-propagation Algorithm Neural Networks and the Back-propagation Algorithm Francisco S. Melo In these notes, we provide a brief overview of the main concepts concerning neural networks and the back-propagation algorithm. We closely

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

CS 4700: Foundations of Artificial Intelligence

CS 4700: Foundations of Artificial Intelligence CS 4700: Foundations of Artificial Intelligence Prof. Bart Selman selman@cs.cornell.edu Machine Learning: Neural Networks R&N 18.7 Intro & perceptron learning 1 2 Neuron: How the brain works # neurons

More information

Learning Methods for Linear Detectors

Learning Methods for Linear Detectors Intelligent Systems: Reasoning and Recognition James L. Crowley ENSIMAG 2 / MoSIG M1 Second Semester 2011/2012 Lesson 20 27 April 2012 Contents Learning Methods for Linear Detectors Learning Linear Detectors...2

More information

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function

More information

Simple Neural Nets For Pattern Classification

Simple Neural Nets For Pattern Classification CHAPTER 2 Simple Neural Nets For Pattern Classification Neural Networks General Discussion One of the simplest tasks that neural nets can be trained to perform is pattern classification. In pattern classification

More information

Artificial Neural Networks The Introduction

Artificial Neural Networks The Introduction Artificial Neural Networks The Introduction 01001110 01100101 01110101 01110010 01101111 01101110 01101111 01110110 01100001 00100000 01110011 01101011 01110101 01110000 01101001 01101110 01100001 00100000

More information

The Perceptron Algorithm

The Perceptron Algorithm The Perceptron Algorithm Greg Grudic Greg Grudic Machine Learning Questions? Greg Grudic Machine Learning 2 Binary Classification A binary classifier is a mapping from a set of d inputs to a single output

More information

Chapter 9: The Perceptron

Chapter 9: The Perceptron Chapter 9: The Perceptron 9.1 INTRODUCTION At this point in the book, we have completed all of the exercises that we are going to do with the James program. These exercises have shown that distributed

More information

Neural Networks (Part 1) Goals for the lecture

Neural Networks (Part 1) Goals for the lecture Neural Networks (Part ) Mark Craven and David Page Computer Sciences 760 Spring 208 www.biostat.wisc.edu/~craven/cs760/ Some of the slides in these lectures have been adapted/borrowed from materials developed

More information

Lecture 4: Linear predictors and the Perceptron

Lecture 4: Linear predictors and the Perceptron Lecture 4: Linear predictors and the Perceptron Introduction to Learning and Analysis of Big Data Kontorovich and Sabato (BGU) Lecture 4 1 / 34 Inductive Bias Inductive bias is critical to prevent overfitting.

More information

A Necessary Condition for Learning from Positive Examples

A Necessary Condition for Learning from Positive Examples Machine Learning, 5, 101-113 (1990) 1990 Kluwer Academic Publishers. Manufactured in The Netherlands. A Necessary Condition for Learning from Positive Examples HAIM SHVAYTSER* (HAIM%SARNOFF@PRINCETON.EDU)

More information

COMP9444 Neural Networks and Deep Learning 2. Perceptrons. COMP9444 c Alan Blair, 2017

COMP9444 Neural Networks and Deep Learning 2. Perceptrons. COMP9444 c Alan Blair, 2017 COMP9444 Neural Networks and Deep Learning 2. Perceptrons COMP9444 17s2 Perceptrons 1 Outline Neurons Biological and Artificial Perceptron Learning Linear Separability Multi-Layer Networks COMP9444 17s2

More information

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396 Data Mining Linear & nonlinear classifiers Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 1 / 31 Table of contents 1 Introduction

More information

COMP-4360 Machine Learning Neural Networks

COMP-4360 Machine Learning Neural Networks COMP-4360 Machine Learning Neural Networks Jacky Baltes Autonomous Agents Lab University of Manitoba Winnipeg, Canada R3T 2N2 Email: jacky@cs.umanitoba.ca WWW: http://www.cs.umanitoba.ca/~jacky http://aalab.cs.umanitoba.ca

More information

1 Machine Learning Concepts (16 points)

1 Machine Learning Concepts (16 points) CSCI 567 Fall 2018 Midterm Exam DO NOT OPEN EXAM UNTIL INSTRUCTED TO DO SO PLEASE TURN OFF ALL CELL PHONES Problem 1 2 3 4 5 6 Total Max 16 10 16 42 24 12 120 Points Please read the following instructions

More information

Νεςπο-Ασαυήρ Υπολογιστική Neuro-Fuzzy Computing

Νεςπο-Ασαυήρ Υπολογιστική Neuro-Fuzzy Computing Νεςπο-Ασαυήρ Υπολογιστική Neuro-Fuzzy Computing ΗΥ418 Διδάσκων Δημήτριος Κατσαρός @ Τμ. ΗΜΜΥ Πανεπιστήμιο Θεσσαλίαρ Διάλεξη 4η 1 Perceptron s convergence 2 Proof of convergence Suppose that we have n training

More information

Lecture Notes 1: Vector spaces

Lecture Notes 1: Vector spaces Optimization-based data analysis Fall 2017 Lecture Notes 1: Vector spaces In this chapter we review certain basic concepts of linear algebra, highlighting their application to signal processing. 1 Vector

More information

Last update: October 26, Neural networks. CMSC 421: Section Dana Nau

Last update: October 26, Neural networks. CMSC 421: Section Dana Nau Last update: October 26, 207 Neural networks CMSC 42: Section 8.7 Dana Nau Outline Applications of neural networks Brains Neural network units Perceptrons Multilayer perceptrons 2 Example Applications

More information

Computational Intelligence Lecture 3: Simple Neural Networks for Pattern Classification

Computational Intelligence Lecture 3: Simple Neural Networks for Pattern Classification Computational Intelligence Lecture 3: Simple Neural Networks for Pattern Classification Farzaneh Abdollahi Department of Electrical Engineering Amirkabir University of Technology Fall 2011 arzaneh Abdollahi

More information

The perceptron learning algorithm is one of the first procedures proposed for learning in neural network models and is mostly credited to Rosenblatt.

The perceptron learning algorithm is one of the first procedures proposed for learning in neural network models and is mostly credited to Rosenblatt. 1 The perceptron learning algorithm is one of the first procedures proposed for learning in neural network models and is mostly credited to Rosenblatt. The algorithm applies only to single layer models

More information

Chapter 2 Single Layer Feedforward Networks

Chapter 2 Single Layer Feedforward Networks Chapter 2 Single Layer Feedforward Networks By Rosenblatt (1962) Perceptrons For modeling visual perception (retina) A feedforward network of three layers of units: Sensory, Association, and Response Learning

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

PETER PAZMANY CATHOLIC UNIVERSITY Consortium members SEMMELWEIS UNIVERSITY, DIALOG CAMPUS PUBLISHER

PETER PAZMANY CATHOLIC UNIVERSITY Consortium members SEMMELWEIS UNIVERSITY, DIALOG CAMPUS PUBLISHER PETER PAZMANY CATHOLIC UNIVERSITY SEMMELWEIS UNIVERSITY Development of Complex Curricula for Molecular Bionics and Infobionics Programs within a consortial* framework** Consortium leader PETER PAZMANY

More information

4. Multilayer Perceptrons

4. Multilayer Perceptrons 4. Multilayer Perceptrons This is a supervised error-correction learning algorithm. 1 4.1 Introduction A multilayer feedforward network consists of an input layer, one or more hidden layers, and an output

More information

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore Lecture - 27 Multilayer Feedforward Neural networks with Sigmoidal

More information

Introduction to Neural Networks

Introduction to Neural Networks Introduction to Neural Networks What are (Artificial) Neural Networks? Models of the brain and nervous system Highly parallel Process information much more like the brain than a serial computer Learning

More information

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers Engineering Part IIB: Module 4F0 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers Phil Woodland: pcw@eng.cam.ac.uk Michaelmas 202 Engineering Part IIB:

More information

Linear discriminant functions

Linear discriminant functions Andrea Passerini passerini@disi.unitn.it Machine Learning Discriminative learning Discriminative vs generative Generative learning assumes knowledge of the distribution governing the data Discriminative

More information

Neural Networks Lecture 2:Single Layer Classifiers

Neural Networks Lecture 2:Single Layer Classifiers Neural Networks Lecture 2:Single Layer Classifiers H.A Talebi Farzaneh Abdollahi Department of Electrical Engineering Amirkabir University of Technology Winter 2011. A. Talebi, Farzaneh Abdollahi Neural

More information

Neural networks. Chapter 20, Section 5 1

Neural networks. Chapter 20, Section 5 1 Neural networks Chapter 20, Section 5 Chapter 20, Section 5 Outline Brains Neural networks Perceptrons Multilayer perceptrons Applications of neural networks Chapter 20, Section 5 2 Brains 0 neurons of

More information

Advanced statistical methods for data analysis Lecture 2

Advanced statistical methods for data analysis Lecture 2 Advanced statistical methods for data analysis Lecture 2 RHUL Physics www.pp.rhul.ac.uk/~cowan Universität Mainz Klausurtagung des GK Eichtheorien exp. Tests... Bullay/Mosel 15 17 September, 2008 1 Outline

More information

Ch 4. Linear Models for Classification

Ch 4. Linear Models for Classification Ch 4. Linear Models for Classification Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science and Engineering Pohang University of Science and echnology 77 Cheongam-ro,

More information

Artificial Neural Networks. Part 2

Artificial Neural Networks. Part 2 Artificial Neural Netorks Part Artificial Neuron Model Folloing simplified model of real neurons is also knon as a Threshold Logic Unit x McCullouch-Pitts neuron (943) x x n n Body of neuron f out Biological

More information

Introduction: The Perceptron

Introduction: The Perceptron Introduction: The Perceptron Haim Sompolinsy, MIT October 4, 203 Perceptron Architecture The simplest type of perceptron has a single layer of weights connecting the inputs and output. Formally, the perceptron

More information

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012 Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood

More information

Neural networks. Chapter 19, Sections 1 5 1

Neural networks. Chapter 19, Sections 1 5 1 Neural networks Chapter 19, Sections 1 5 Chapter 19, Sections 1 5 1 Outline Brains Neural networks Perceptrons Multilayer perceptrons Applications of neural networks Chapter 19, Sections 1 5 2 Brains 10

More information

The Perceptron Algorithm

The Perceptron Algorithm The Perceptron Algorithm Machine Learning Spring 2018 The slides are mainly from Vivek Srikumar 1 Outline The Perceptron Algorithm Perceptron Mistake Bound Variants of Perceptron 2 Where are we? The Perceptron

More information

Unsupervised Learning with Permuted Data

Unsupervised Learning with Permuted Data Unsupervised Learning with Permuted Data Sergey Kirshner skirshne@ics.uci.edu Sridevi Parise sparise@ics.uci.edu Padhraic Smyth smyth@ics.uci.edu School of Information and Computer Science, University

More information

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Gaussian and Linear Discriminant Analysis; Multiclass Classification Gaussian and Linear Discriminant Analysis; Multiclass Classification Professor Ameet Talwalkar Slide Credit: Professor Fei Sha Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015

More information

In the Name of God. Lectures 15&16: Radial Basis Function Networks

In the Name of God. Lectures 15&16: Radial Basis Function Networks 1 In the Name of God Lectures 15&16: Radial Basis Function Networks Some Historical Notes Learning is equivalent to finding a surface in a multidimensional space that provides a best fit to the training

More information

Artificial Neural Networks

Artificial Neural Networks Artificial Neural Networks 鮑興國 Ph.D. National Taiwan University of Science and Technology Outline Perceptrons Gradient descent Multi-layer networks Backpropagation Hidden layer representations Examples

More information

Linear Classifiers as Pattern Detectors

Linear Classifiers as Pattern Detectors Intelligent Systems: Reasoning and Recognition James L. Crowley ENSIMAG 2 / MoSIG M1 Second Semester 2014/2015 Lesson 16 8 April 2015 Contents Linear Classifiers as Pattern Detectors Notation...2 Linear

More information

Feedforward Neural Nets and Backpropagation

Feedforward Neural Nets and Backpropagation Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28 th, 2016 1 / 23 Supervised Learning Roadmap Supervised Learning: Assume that we are given the features

More information

Course 395: Machine Learning - Lectures

Course 395: Machine Learning - Lectures Course 395: Machine Learning - Lectures Lecture 1-2: Concept Learning (M. Pantic) Lecture 3-4: Decision Trees & CBC Intro (M. Pantic & S. Petridis) Lecture 5-6: Evaluating Hypotheses (S. Petridis) Lecture

More information

Neural Networks. Chapter 18, Section 7. TB Artificial Intelligence. Slides from AIMA 1/ 21

Neural Networks. Chapter 18, Section 7. TB Artificial Intelligence. Slides from AIMA   1/ 21 Neural Networks Chapter 8, Section 7 TB Artificial Intelligence Slides from AIMA http://aima.cs.berkeley.edu / 2 Outline Brains Neural networks Perceptrons Multilayer perceptrons Applications of neural

More information

Linear Discrimination Functions

Linear Discrimination Functions Laurea Magistrale in Informatica Nicola Fanizzi Dipartimento di Informatica Università degli Studi di Bari November 4, 2009 Outline Linear models Gradient descent Perceptron Minimum square error approach

More information

EEE 241: Linear Systems

EEE 241: Linear Systems EEE 4: Linear Systems Summary # 3: Introduction to artificial neural networks DISTRIBUTED REPRESENTATION An ANN consists of simple processing units communicating with each other. The basic elements of

More information

Numerical Learning Algorithms

Numerical Learning Algorithms Numerical Learning Algorithms Example SVM for Separable Examples.......................... Example SVM for Nonseparable Examples....................... 4 Example Gaussian Kernel SVM...............................

More information

Worst-Case Analysis of the Perceptron and Exponentiated Update Algorithms

Worst-Case Analysis of the Perceptron and Exponentiated Update Algorithms Worst-Case Analysis of the Perceptron and Exponentiated Update Algorithms Tom Bylander Division of Computer Science The University of Texas at San Antonio San Antonio, Texas 7849 bylander@cs.utsa.edu April

More information

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization Prof. Daniel Cremers 6. Mixture Models and Expectation-Maximization Motivation Often the introduction of latent (unobserved) random variables into a model can help to express complex (marginal) distributions

More information

FORMULATION OF THE LEARNING PROBLEM

FORMULATION OF THE LEARNING PROBLEM FORMULTION OF THE LERNING PROBLEM MIM RGINSKY Now that we have seen an informal statement of the learning problem, as well as acquired some technical tools in the form of concentration inequalities, we

More information

Fundamentals of Neural Networks

Fundamentals of Neural Networks Fundamentals of Neural Networks : Soft Computing Course Lecture 7 14, notes, slides www.myreaders.info/, RC Chakraborty, e-mail rcchak@gmail.com, Aug. 10, 2010 http://www.myreaders.info/html/soft_computing.html

More information

Algorithmisches Lernen/Machine Learning

Algorithmisches Lernen/Machine Learning Algorithmisches Lernen/Machine Learning Part 1: Stefan Wermter Introduction Connectionist Learning (e.g. Neural Networks) Decision-Trees, Genetic Algorithms Part 2: Norman Hendrich Support-Vector Machines

More information

Clustering VS Classification

Clustering VS Classification MCQ Clustering VS Classification 1. What is the relation between the distance between clusters and the corresponding class discriminability? a. proportional b. inversely-proportional c. no-relation Ans:

More information

Neural Networks Lecturer: J. Matas Authors: J. Matas, B. Flach, O. Drbohlav

Neural Networks Lecturer: J. Matas Authors: J. Matas, B. Flach, O. Drbohlav Neural Networks 30.11.2015 Lecturer: J. Matas Authors: J. Matas, B. Flach, O. Drbohlav 1 Talk Outline Perceptron Combining neurons to a network Neural network, processing input to an output Learning Cost

More information

Radial-Basis Function Networks

Radial-Basis Function Networks Radial-Basis Function etworks A function is radial () if its output depends on (is a nonincreasing function of) the distance of the input from a given stored vector. s represent local receptors, as illustrated

More information

Machine Learning 2017

Machine Learning 2017 Machine Learning 2017 Volker Roth Department of Mathematics & Computer Science University of Basel 21st March 2017 Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 1 / 41 Section

More information

) (d o f. For the previous layer in a neural network (just the rightmost layer if a single neuron), the required update equation is: 2.

) (d o f. For the previous layer in a neural network (just the rightmost layer if a single neuron), the required update equation is: 2. 1 Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science 6.034 Artificial Intelligence, Fall 2011 Recitation 8, November 3 Corrected Version & (most) solutions

More information

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition LINEAR CLASSIFIERS Classification: Problem Statement 2 In regression, we are modeling the relationship between a continuous input variable x and a continuous target variable t. In classification, the input

More information

Intelligent Systems Discriminative Learning, Neural Networks

Intelligent Systems Discriminative Learning, Neural Networks Intelligent Systems Discriminative Learning, Neural Networks Carsten Rother, Dmitrij Schlesinger WS2014/2015, Outline 1. Discriminative learning 2. Neurons and linear classifiers: 1) Perceptron-Algorithm

More information

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington Linear Classification CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Example of Linear Classification Red points: patterns belonging

More information

The Perceptron algorithm

The Perceptron algorithm The Perceptron algorithm Tirgul 3 November 2016 Agnostic PAC Learnability A hypothesis class H is agnostic PAC learnable if there exists a function m H : 0,1 2 N and a learning algorithm with the following

More information

Mining Classification Knowledge

Mining Classification Knowledge Mining Classification Knowledge Remarks on NonSymbolic Methods JERZY STEFANOWSKI Institute of Computing Sciences, Poznań University of Technology COST Doctoral School, Troina 2008 Outline 1. Bayesian classification

More information

Artificial Neural Networks Lecture Notes Part 3

Artificial Neural Networks Lecture Notes Part 3 Artificial Neural Networks Lecture Notes Part 3 About this file: This is the printerfriendly version of the file "lecture03.htm". If you have trouble reading the contents of this file, or in case of transcription

More information

Notes on Discriminant Functions and Optimal Classification

Notes on Discriminant Functions and Optimal Classification Notes on Discriminant Functions and Optimal Classification Padhraic Smyth, Department of Computer Science University of California, Irvine c 2017 1 Discriminant Functions Consider a classification problem

More information

Lecture 6. Regression

Lecture 6. Regression Lecture 6. Regression Prof. Alan Yuille Summer 2014 Outline 1. Introduction to Regression 2. Binary Regression 3. Linear Regression; Polynomial Regression 4. Non-linear Regression; Multilayer Perceptron

More information

Principles of Artificial Intelligence Fall 2005 Handout #7 Perceptrons

Principles of Artificial Intelligence Fall 2005 Handout #7 Perceptrons Principles of Artificial Intelligence Fall 2005 Handout #7 Perceptrons Vasant Honavar Artificial Intelligence Research Laboratory Department of Computer Science 226 Atanasoff Hall Iowa State University

More information

Single Layer Perceptron Networks

Single Layer Perceptron Networks Single Layer Perceptron Networks We have looked at what artificial neural networks (ANNs) can do, and by looking at their history have seen some of the different types of neural network. We started looking

More information

CSC Neural Networks. Perceptron Learning Rule

CSC Neural Networks. Perceptron Learning Rule CSC 302 1.5 Neural Networks Perceptron Learning Rule 1 Objectives Determining the weight matrix and bias for perceptron networks with many inputs. Explaining what a learning rule is. Developing the perceptron

More information

Reification of Boolean Logic

Reification of Boolean Logic 526 U1180 neural networks 1 Chapter 1 Reification of Boolean Logic The modern era of neural networks began with the pioneer work of McCulloch and Pitts (1943). McCulloch was a psychiatrist and neuroanatomist;

More information

Artificial Neural Networks

Artificial Neural Networks Artificial Neural Networks Math 596 Project Summary Fall 2016 Jarod Hart 1 Overview Artificial Neural Networks (ANNs) are machine learning algorithms that imitate neural function. There is a vast theory

More information

Neural networks. Chapter 20. Chapter 20 1

Neural networks. Chapter 20. Chapter 20 1 Neural networks Chapter 20 Chapter 20 1 Outline Brains Neural networks Perceptrons Multilayer networks Applications of neural networks Chapter 20 2 Brains 10 11 neurons of > 20 types, 10 14 synapses, 1ms

More information

Input layer. Weight matrix [ ] Output layer

Input layer. Weight matrix [ ] Output layer MASSACHUSETTS INSTITUTE OF TECHNOLOGY Department of Electrical Engineering and Computer Science 6.034 Artificial Intelligence, Fall 2003 Recitation 10, November 4 th & 5 th 2003 Learning by perceptrons

More information

The Perceptron Algorithm 1

The Perceptron Algorithm 1 CS 64: Machine Learning Spring 5 College of Computer and Information Science Northeastern University Lecture 5 March, 6 Instructor: Bilal Ahmed Scribe: Bilal Ahmed & Virgil Pavlu Introduction The Perceptron

More information

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions BACK-PROPAGATION NETWORKS Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks Cannot approximate (learn) non-linear functions Difficult (if not impossible) to design

More information

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7. Preliminaries Linear models: the perceptron and closest centroid algorithms Chapter 1, 7 Definition: The Euclidean dot product beteen to vectors is the expression d T x = i x i The dot product is also

More information

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra. DS-GA 1002 Lecture notes 0 Fall 2016 Linear Algebra These notes provide a review of basic concepts in linear algebra. 1 Vector spaces You are no doubt familiar with vectors in R 2 or R 3, i.e. [ ] 1.1

More information

Data Mining Part 5. Prediction

Data Mining Part 5. Prediction Data Mining Part 5. Prediction 5.5. Spring 2010 Instructor: Dr. Masoud Yaghini Outline How the Brain Works Artificial Neural Networks Simple Computing Elements Feed-Forward Networks Perceptrons (Single-layer,

More information

9 Classification. 9.1 Linear Classifiers

9 Classification. 9.1 Linear Classifiers 9 Classification This topic returns to prediction. Unlike linear regression where we were predicting a numeric value, in this case we are predicting a class: winner or loser, yes or no, rich or poor, positive

More information

Structure learning in human causal induction

Structure learning in human causal induction Structure learning in human causal induction Joshua B. Tenenbaum & Thomas L. Griffiths Department of Psychology Stanford University, Stanford, CA 94305 jbt,gruffydd @psych.stanford.edu Abstract We use

More information

DEEP LEARNING CHAPTER 3 PROBABILITY & INFORMATION THEORY

DEEP LEARNING CHAPTER 3 PROBABILITY & INFORMATION THEORY DEEP LEARNING CHAPTER 3 PROBABILITY & INFORMATION THEORY OUTLINE 3.1 Why Probability? 3.2 Random Variables 3.3 Probability Distributions 3.4 Marginal Probability 3.5 Conditional Probability 3.6 The Chain

More information

Lecture 4: Feed Forward Neural Networks

Lecture 4: Feed Forward Neural Networks Lecture 4: Feed Forward Neural Networks Dr. Roman V Belavkin Middlesex University BIS4435 Biological neurons and the brain A Model of A Single Neuron Neurons as data-driven models Neural Networks Training

More information