Notes and Solutions for: Pattern Recognition by Sergios Theodoridis and Konstantinos Koutroumbas.

Size: px

Start display at page:

Download "Notes and Solutions for: Pattern Recognition by Sergios Theodoridis and Konstantinos Koutroumbas."

Lizbeth Brown
5 years ago
Views:

1 Notes and Solutions for: Pattern Recognition by Sergios Theodoridis and Konstantinos Koutroumbas. John L. Weatherwax January 9, 006 Introduction Here you ll find some notes that I wrote up as I worked through this excellent book. I ve workedhardtomakethesenotesasgoodasican, butihavenoillusionsthattheyareperfect. If you feel that that there is a better way to accomplish or explain an exercise or derivation presented in these notes; or that one or more of the explanations is unclear, incomplete, or misleading, please tell me. If you find an error of any kind technical, grammatical, typographical, whatever please tell me that, too. I ll gladly add to the acknowledgments in later printings the name of the first person to bring each problem to my attention. Acknowledgments Special thanks to (most recent comments are listed first): Karonny F, Mohammad Heshajin for helping improve these notes and solutions. All comments (no matter how small) are much appreciated. In fact, if you find these notes useful I would appreciate a contribution in the form of a solution to a problem that I did not work, a mathematical derivation of a statement or comment made in the book that was unclear, a piece of code that implements one of the algorithms discussed, or a correction to a typo (spelling, grammar, etc) about these notes. Sort of a take a penny, leave a penny type of approach. Remember: pay it forward. wax@alum.mit.edu

2 Classifiers Based on Bayes Decision Theory Notes on the text Minimizing the average risk The symbol r k is the expected risk associated with observing an object from class k. This risk is divided up into parts that depend on what we then do when an object from class k with feature vector x is observed. Now we only observe the feature vector x and not the true class label k. Since we must still perform an action when we observe x let λ ki represent the loss associated with the event that the object is truly from class k and we decided that it is from class i. Define r k as the expected loss when an object of type k is presented to us. Then r k = = M λ ki P(we classify this object as a member of class i) M λ ki p(x ω k )dx, R i which is the books equation.4. Thus the total risk r is the expected value of the class dependent risks r k taking into account how likely each class is or r = = = M r k P(ω k ) k= M k= M M λ ki p(x ω k )P(ω k )dx R i R i k= ( M ) λ ki p(x ω k )P(ω k ) dx. () The decision rule that leads to the smallest total risk is obtained by selecting R i to be the region of feature space in which the integrand above is as small as possible. That is, R i should be defined as the values of x such that for that value of i we have M λ ki p(x ω k )P(ω k ) < k= M λ kj p(x ω k )P(ω k ) j. k= In words the index i, when put in the sum above gives the smallest value when compared to all other possible choices. For these values of x we should select class ω i as our classification decision.

3 Bayesian classification with normal distributions When the covariance matrices for two classes are the same and diagonal i.e. Σ i = Σ j = σ I then the discrimination functions g ij (x) are given by g ij (x) = w T (x x 0 ) = (µ i µ j ) T (x x 0 ), () since the vector w is w = µ i µ j in this case. Note that the point x 0 is on the decision hyperplane i.e. satisfies g ij (x) = 0 since g ij (x 0 ) = w T (x 0 x 0 ) = 0. Let x be another point on the decision hyperplane, then x x 0 is a vector in the decision hyperplane. Since x is a point on the decision hyperplane it also must satisfy g ij (x) = 0 from the functional form for g ij ( ) and the definition of w is this means that w T (x x 0 ) = (µ i µ j ) T (x x 0 ) = 0. This is the statement that the line connecting µ i and µ j is orthogonal to the decision hyperplane. In the same way, when the covariance matrices of each class are not diagonal but are nevertheless the Σ i = Σ j = Σ the same logic that we used above states that the decision hyperplane is again orthogonal to the vector w which in this case is Σ (µ i µ j ). The magnitude of P(ω i ) relative to P(ω j ) influences how close the decision hyperplane is to the respective class means µ i or µ j, in the sense that the class with the larger a priori probability will have a larger region of R l assigned to it for classification. For example, if P(ω i ) < P(ω j ) then ln by we can write as ( P(ωi ) P(ω j ) ) < 0 so the point x 0 which in the case Σ i = Σ j = Σ is given x 0 = (µ i +µ j ) ln ( ) P(ωi ) µ i µ j, (3) P(ω j ) µ i µ j Σ x 0 = (µ i +µ j )+α(µ i µ j ), with the value of α > 0. Since µ i µ j is a vector from µ j to µ i the expression for x 0 above starts at the midpoint (µ i + µ j ) and moves closer to µ i. Meaning that the amount of R l assigned to class ω j is larger than the amount assigned to class ω i. This is expected since the prior probability of class ω j is larger than that of ω i. Notes on Example. To see the final lengths of the principal axes we start with the transformed equation of constant Mahalanobis distance of d m =.95 or (x µ ) λ + (x µ ) λ = (.95) =.95. Since we want the principal axis about (0,0) we have µ = µ = 0 and λ and λ are the eigenvalues given by solving Σ λi = 0. In this case, we get λ = (in direction v ) and λ = (in direction v ). Then the above becomes in standard form for a conic section (x ).95λ + (x ).95λ =. 3

4 From this expression we can read off the lengths of the principle axis.95λ =.95 = λ =.95() = Maximum A Posteriori (MAP) Estimation: Example.4 We will derive the MAP estimate of the population mean µ when given N samples x k distributed as p(x µ) and a normal prior on µ i.e. N(µ 0,σµI). Then the estimate of the population mean µ given the sample X {x k } N k= is proportional to p(µ X) p(µ)p(x µ) = p(µ) N p(x k µ). Note that we have written p(µ) on the outside of the product terms since it should only appear once and not N times as might be inferred by had we written the product as N k= p(µ)p(x k µ). To find the value of µ that maximized this we take begin by taking the natural log of the expression above, taking the µ derivative and setting the resulting expression equal to zero. We find the natural log of the above given by ln(p(µ))+ ln(p(x k µ)) = µ µ 0 k= σ µ k= (x k µ) T Σ (x k µ). Then taking the derivative with respect to µ, setting the result equal to zero, and calling that solution ˆµ gives (ˆµ µ σµ 0 )+ N (x σ k ˆµ) = 0, were we have assumed that the density p(x µ) is N(µ,Σ) with Σ = σ I. When we solve for ˆµ in the above we get k= k= ˆµ = σ µ µ 0 + σ N k= x k N + σ σµ = µ 0 + σ µ σ N k= x k + σ µ σ N (4) Maximum Entropy Estimation As another method to determine distribution parameters we seek to maximize the entropy H or H = p(x) ln(p(x))dx. (5) X This is equivalent to minimizing its negative or p(x)ln(p(x))dx. To incorporate the constraint that the density must integrate to one, we form the entropy Lagrangian X x ( x ) H L = p(x)ln(p(x))dx λ x p(x)dx x, 4

5 where we have assumed that our density is non-zero only over [x,x ]. The negative of the above is equivalent to x H L = p(x)(ln(p(x)) λ)dx λ. x Taking the p(x) derivative and setting it equal to zero ( H L ) p = = Solving for the integral of ln(p(x)) we get x x x [(ln(p) λ)+p x ( ) ]dx p x [ln(p) λ+]dx = 0. x ln(p(x))dx = (λ )(x x ). Take the x derivative of this expression and we find x ln(p(x )) = λ p(x ) = e λ,. To find the value of λ we put this expression into our constraint of x x p(x)dx = to get ( or λ = ln x x ), thus a uniform distribution. e λ (x x ) =, { ( )} p(x) = exp ln = x x x x, Problem Solutions Problem. (the Bayes rule minimized the probability of error) Following the hint in the book, the probability of correct classification P c is given by P c = M P(x R i,ω i ), since in order to be correct when x R i the sample that generated x must come from the class ω i. Now this joint probability is given by P(x R i,ω i ) = P(x R i ω i )P(ω i ) ( ) = p(x ω i )dx P(ω i ). R i 5

6 So the expression for P c then becomes M ( ) P c = p(x ω i )dx P(ω i ) R i = M ( ) p(x ω i )P(ω i )dx. (6) R i Since this is a sum of M different terms to maximize P c we will define R i to be the region of x where p(x ω i )P(ω i ) > p(x ω j )P(ω j ) j i. (7) If we do this, then since in R i from Equation 7 the expression R i p(x ω i )P(ω i )dx will be as large as possible. As Equation 6 is the sum of such terms we will have also maximized P c. Now dividing both sides of Equation 7 and using Bayes rule we have P(ω i x) > P(ω j x) j i, as the multi-class decision boundary, what we were to show. Problem. (finding the decision boundary) Using the books notation where λ ki is the loss associated with us deciding an object is from class i when it infact is fromclass k we need to comparethe expressions given by Equation. Since this is a two class problem M = and we need to compare Under zero-one loss these become l = λ p(x ω )P(ω )+λ p(x ω )P(ω ) l = λ p(x ω )P(ω )+λ p(x ω )P(ω ). l = λ p(x ω )P(ω ) l = λ p(x ω )P(ω ). When l < l we will classify x as from class ω and from class ω otherwise. The decision boundary will be thepoint x 0 where l = l. This later equation (if we solve for thelikelihood ratio) is p(x ω ) p(x ω ) = λ P(ω ) λ P(ω ). (8) If we assume that p(x ω ) N(0,σ ) and p(x ω ) N(,σ ) then Setting this equal to λ P(ω ) λ P(ω ) as we were to show. p(x ω ) p(x ω ) = e e x σ (x ) σ { = exp } σ (x ). and solving for x gives x = ( ) λ P(ω ) σ ln, λ P(ω ) 6

7 Problem.3 (an expression for the average risk) We are told to define the errors ε, as the class conditional error or ε = P(x R ω ) = p(x ω )dx R ε = P(x R ω ) = p(x ω )dx. R Using these definitions we can manipulate the average risk as ) r = P(ω ) (λ p(x ω )dx+λ p(x ω )dx R R ) + P(ω ) (λ p(x ω )dx+λ p(x ω )dx R R ( ) = P(ω ) (λ p(x ω )dx )+λ p(x ω )dx R R ( )) + P(ω ) (λ p(x ω )dx+λ p(x ω )dx R R = λ P(ω ) λ ε P(ω )+λ ε P(ω )+λ ε P(ω )+λ P(ω ) λ ε P(ω ) = λ P(ω )+λ P(ω )+(λ λ )ε P(ω )+(λ λ )ε P(ω ), resulting in the desired expression. Problem.4 (bounding the probability of error) We desire to show that P e M. To do this recall that since M P(ω i x) = at least one P(ω i x) must be larger than M otherwise the sum M P(ω i x) would have to be less than one. Now let P(ω i x) be the Bayes classification decision. That is P(ω i x) = maxp(ω i x). i From the above discussion P(ω i x). From this we see that M the desired expression. P e = max i P(ω i x) M = M M, 7

8 Problem.5 (classification with Gaussians of the same mean) Since this is a two-class problem we can use the results from the book. We compute l the likelihood ratio l = p(x ω ( ) p(x ω ) = σ exp { x )}. (9) σ σ σ and compare this to the threshold t defined by t P(ω ( ) ) λ λ. P(ω ) λ λ Then if l > t, then we classify x as a member of ω. In the same way if l < t then we classify x as a member of ω. The decision point x 0 where we switch classification is given by l (x 0 ) = t or Solving for x 0 we get { exp x 0 ( )} σ σ σ σ ( x 0 = ± σ σ σ (σ σ) ln P(ω ) σp(ω ) = σ t. σ ( )) λ λ. λ λ For specific values of the parameters in this problem: σ i, P(ω i ), and λ ij the two values for x 0 above can be evaluated. These two values x 0 differ only in their sign and have the same magnitude. For these given values of x 0 we see that if x x 0 one class is selected as the classification decision while if x > x 0 the other class is selected. The class selected depends on the relative magnitude of the parameters σ i, P(ω i), and λ ij and seems difficult to determine a priori. To determine the class once we are given a fixed specification of these numbers we can evaluate l in Equation 9 for a specific value of x such that x x 0 (say x = 0) to determine if l < t or not. If so the region of x s given by x x 0 will be classified as members of class ω, while the region of x s where x > x 0 would be classified as members of ω. Problem.6 (the Neyman-Pearson classification criterion) Warning: I was not able to solve this problem. Below are a few notes I made while attempting it. If anyone sees a way to proceed please contact me. For this problem we want to fix the value of the error probability for class one at a particular value say ε = ε and then we want to minimize the probability of error we make when classifying the other class. Now recall the definitions of the error probability ε i ε = p(x ω )P(ω )dx, R and ε = p(x ω )P(ω )dx. R 8

9 We want to find a region R (equivalently a region R ) such that ε is minimized under the constraint that ε = ε. We can do this using the method of Lagrange multipliers. To use this method first form the Lagrangian q defined as q = θ(ε ε)+ε. To determine the classification decision region boundary x 0 that minimizes this Lagrangian let R = (,x 0 ) and R = (x 0,+ ) to get for q ( + ) x0 q = θ P(ω ) p(x ω )dx ε +P(ω ) p(x ω )dx. x 0 To minimized q with respect to x 0 requires taking x 0 of the above expression and setting this derivative equal to zero. This gives q x 0 = θp(ω )p(x 0 ω )+P(ω )p(x 0 ω ) = 0. Solving for θ in the above expression we see that it is given by θ = P(ω )p(x 0 ω ) P(ω )p(x 0 ω ) = P(ω x 0 ) P(ω x 0 ). Problem.9 (deriving an expression for P B ) If P(ω ) = P(ω ) =, then the zero-one likelihood ratio given by Equation 8 when λ = λ =, becomes p(x ω ) p(x ω ) =. Taking logarithms of both sides of this expression gives ln(p(x ω )) ln(p(x ω )) = 0. If we define this scalar difference as u we see that when both class condition densities are multi-dimensional normal with the same covariance Σ (but with different means µ i ) that u becomes ( u ln (π) d/ Σ ( ln { exp / exp (π) d/ Σ / }) (x µ ) T Σ (x µ ) { }) (x µ ) T Σ (x µ ) = (x µ ) T Σ (x µ )+ (x µ ) T Σ (x µ ) = [ x T Σ x µ T Σ x+µ T Σ µ x T Σ x+µ T Σ x µ T Σ µ ] = (µ µ ) T Σ x µ Σ µ + µt Σ µ. 9

10 Now if x is a sample from class ω then from the class density assumptions and the above expression (it is linear in x) we see that the variable u will be a Gaussian random variable with a mean m given by m = (µ µ )Σ µ µt Σ µ + µt Σ µ = µt Σ µ µ Σ µ + µt Σ µ = (µ µ ) T Σ (µ µ ), (0) and a variance given by (µ µ ) T Σ ΣΣ (µ µ ) = (µ µ ) T Σ (µ µ ). () In the same way, if x is from class ω then u will be Gaussian with a mean m given by m = (µ µ )Σ µ µt Σ µ + µt Σ µ = µ T Σ µ µ Σ µ µt Σ µ = (µ µ ) T Σ (µ µ ) = m, () and the same variance as in the case when x is from ω and given by Equation. Note that in terms of the Mahalanobis distance d m between the means µ i defined as d m (µ µ ) T Σ (µ µ ), we have m = d m = m. Also note that the variance of p(u ω i ) given in Equation in terms of the Mahalanobis distance is d m and so the common standard deviation would be d m. Thus an expression for the Bayesian error probability, P B, for this classification problem, can be computed based on the one-dimensional random variable u rather than the multidimensional random variable x. Since u is a scalar expression for which we know its probability density we can compute the probability of error for a classification procedure on x by using u s density. As just computed the densities p(u ω ) and p(u ω ) are Gaussian with symmetric means (m = m ) and equal variances so the the optimal Bayes classification threshold value (where classification decisions change) is 0. This makes it easy to calculate P B the optimal Bayes error estimate using P B = 0 p(u ω )P(ω )du+ 0 p(u ω )P(ω )du. Since P(ω ) = P(ω ) =, and the integrals of the conditional densities are equal (by symmetry) we can evaluate just one. Specifically, from the discussion above we have p(u ω ) = πdm exp { ( u d m d m )}, and we find P B = ( 0 u e d ) m dm du. π d m 0

11 Let z = u d m d m so dz = d m du and P B becomes P B = π d m d m dm + e z dz = π e dm z dz, (3) the result we were to show. Using the cumulative distribution function for the standard normal, denoted here in the notation of the Matlab function, normcdf, defined as normcdf(x;µ,σ) = πσ x we can write the above expression for P B as P B = normcdf which is easier to implement in Matlab. e (t µ) σ dt, (4) ( ) d m;0,, (5) Problem.0- (Mahalanobis distances and classification) Taking the logarithm of the likelihood ratio l gives a decision rule to state x ω (and x ω otherwise) if ln(p(x ω )) ln(p(x ω )) > ln(θ). (6) If our conditional densities, p(x ω i ), are given by a multidimensional normal densities then they have functional forms given by { p(x ω i ) = N(x;µ i,σ i ) exp } (π) d/ Σ i / (x µ i) t Σ i (x µ i ). (7) Taking the logarithm of this expression as required by Equation 6 we find ln(p(x ω i )) = (x µ i) t Σ i (x µ i ) d ln(π) ln( Σ i ) = d m(µ i,x Σ i ) d ln(π) ln( Σ i ), where we have introduced the Mahalanobis distance d m in the above. Our decision rule given by 6 in the case when p(x ω i ) is a multidimensional Gaussian is thus given by d m(µ,x Σ ) ln( Σ )+ d m(µ,x Σ ) + ln( Σ ) > ln(θ). or ( ) d m (µ,x Σ ) d m (µ,x Σ ) Σ +ln < ln(θ). (8) Σ We will now consider some specializations of these expressions for various possible values of Σ i. If our decision ( ) boundaries are given by Equation 8, but with equal covariances, then we have that ln Σ Σ = 0 and for decision purposes the left-hand-side of Equation 8 becomes d m (µ,x Σ ) d m (µ,x Σ ) = ( ) x T Σ x x T Σ µ +µ T Σ µ ( ) x T Σ x x T Σ µ +µ T Σ µ = x T Σ (µ µ )+µ T Σ µ µ Σ µ.

12 Using this in Equation 8 we get (µ µ ) T Σ x > ln(θ)+ µ Σ µ Σ, (9) for the decision boundary. I believe the book is missing the square on the Mahalanobis norm of µ i. Problem. (an example classifier design) Part (a): The optimal two class classifier that minimizes the probability of error is given by classifying a point x depending on the value of the likelihood ratio l = p(x ω ) p(x ω ). When the class conditional probability densities, p(x ω i ), are multivariate Gaussian with each feature dimension having the same variance σ = σ = σ the above likelihood ratio l becomes l = exp{ σ ((x µ ) T (x µ ) (x µ ) T (x µ ))}. The value of l is compared to an expression involving the priori probabilities P(ω i ), and loss expressions, λ ki, where λ ki is the loss associated with classifying an object from the true class k as an object from class i. The classification decision is then made according to x ω if l = p(x ω ) p(x ω ) > P(ω ) P(ω ) ( ) λ λ, (0) λ λ and x ω otherwise. To minimize the error probability requires we take λ ki = δ ki where δ ki is the Kronecker delta. In this case, and when the priori probabilities are equal the constant on the right-hand-side of Equation 0 evaluates to. Thus our classification rules is to select class if l > and otherwise select class. This rule is equivalent to select x ω if p(x ω ) > p(x ω ). Using the above expression the decision rule l > simplifies as follows (x µ ) T (x µ ) (x µ ) T (x µ ) < 0 or (µ µ ) T x < µ T µ µ T µ or (µ µ ) T x > µt µ µ T µ This is equivalent to Equation 9 when we take θ = and equal features covariances..

13 Part (b): In this case, fromthe given loss matrix, Λ, we see that λ = 0, λ =, λ = 0.5, λ = 0 and the right-hand-side of Equation 0 becomes. Then the requirement on the likelihood ratio is then l >, which when we take the logarithm of both sides becomes σ [(x µ ) T (x µ ) (x µ ) T (x µ )] > ln(), which simplifies in exactly the same was as before to (µ µ ) T x > µt µ µ T µ +σ ln(). Experiments at generating and classifying 0000 random feature vectors from each class using the previous expressions and then estimating the classification error probability can be found in the Matlab script chap prob.m. For Part (a) we can also use the results of Problem.9 on Page 9 namely Equation 5 to exactly compute the error probability P B and compare it to our empirically computed error probability. When we run that script we get the following results empirical P_e= ; analytic Bayes P_e= showing that the empirical results are quite close to the theoretical. For Part (b) we can compute the empirical loss associated with using this classifier in the following way. Let L be the number of samples from the first class that are misclassified as belonging to the second class, L be the number of samples from the second class that are misclassified as belonging to the first class, and N be the total number of samples we classified. Then an empirical estimate of the expected loss ˆr given by ˆr = L +0.5L N. Problem.3 (more classifier design) Note that since for this problem since the functional form of the class conditional densities, p(x ω i ), have changed, to construct the decision rulein terms of xas we did forproblem., we would need to again simplify the likelihood ratio expression Equation 0. If all we care about is the classification of a point x we can skip these algebraic transformations and simply compute l = p(x ω ) p(x σ, directly and then compare this to the simplified right-hand-sides of ) Equation 0, which for Part (a) is and for Part (b) is. Again for Part (a) we can exactly compute the Bayes error rate using Equation 5. This procedure is implemented in Matlab script chap prob 3.m. When we run that script we get empirical P_e= ; analytic Bayes P_e= again showing that the empirical results are quite close to the theoretical. 3

14 Problem.4 (orthogonality of the decision hyperplane) We want to show that the decision hyperplane at the point x 0 is tangent to the constant Mahalanobis distance hyperellipsoid d m (x,µ i ) = c i. If we can show that the vector v defined by v d m(x,µ i ) x. x=x0 is orthogonal to all vectors in the decision hyperplane, then since v is normal to the surfaces d m (x,µ i ) = c i, we will have a tangent decision hyperplane. The Mahalanobis distance between µ i and a point x is given by d m (µ i,x) = ( (x µ i ) T Σ (x µ i ) ) /. The gradient of d m (µ i,x) considered as a function of x is given by d m (µ i,x) x = ( (x µi ) T Σ (x µ i ) ) / ( (x µi ) T Σ (x µ i ) ) x = (d m ) /( Σ (x µ i ) ) = d m (µ i,x) Σ (x µ i ). Consider this expression evaluated at x = x 0, we would have to evaluate x 0 µ i. From the expression for x 0 this is x 0 µ i = ( ) ( µ P(ωi ) µ i µ j i +µ j ) ln, P(ω j ) µ i µ j Σ which is a vector proportional to µ i µ j. Thus we see that for a point x on the decision hyperplane we have that v T ( (x x 0 ) = Σ (x 0 µ i ) )T (x x 0 ) d m (µ i,x 0 ) ( Σ (µ i µ j ) ) T (x x0 ) = 0, since for the case of multidimensional Gaussians with equal non-diagonal covariance matrices, Σ, the points x on the decision hyperplanes are given by g ij (x) = ( Σ (µ i µ j ) ) T (x x0 ) = 0. The decision hyperplane is also tangent to the surfaces d m (x,µ j ) = c j. Since to show that we would then need to evaluate d m(µ j,x) Σ (x µ j ) at x = x 0 and we would again find that x 0 µ j is again proportional to µ i µ j. Problem.5 (bounding the probability of error) We are told to assume that p(x ω ) N(µ,σ ) p(x ω ) U(a,b) = 4 { b a a < x < b 0 otherwise

15 The minimum probability of error criterion is to classify a sample with feature xas a member of the class ω if p(x ω ) p(x ω ) > P(ω ) P(ω ), and classify x as from the class ω otherwise. Since p(x ω ) is zero for some values of x we should write this expression as p(x ω ) > P(ω ) P(ω ) p(x ω ). Note that from the above if x / [a,b] since p(x ω ) = 0 the above expression will be true and we would classify that point as from class ω. It remains to determine how we would classify any points x [a,b] as. Since p(x ω ) is a uniform distribution the above inequality is given by p(x ω ) > P(ω ) P(ω ) b a. () Given that the density p(x ω ) is a Gaussian we can find any values x 0 such that the above inequality is an equality. That is x 0 must satisfy e (x 0 µ) σ = P(ω ) πσ P(ω ) b a. When we solve for x 0 we find ( ) x 0 = µ±σ πσ P(ω ) ln. () b a P(ω ) There are at most two real solutions for x 0 in the above expression which we denote by x 0 and x + 0, where x 0 < x + 0. Depending on the different possible values of x 0 and x + 0 we will evaluate the error probability P e. If neither x 0 and x+ 0 are real then the classification decision regions in this case become Then P e is given by R = (,a) (b,+ ) R = (a,b). P e = P(ω ) p(x ω )dx+p(ω ) R = P(ω ) b since R p(x ω )dx = 0 for the R region. a p(x ω )dx < b a R p(x ω )dx p(x ω )dx, (3) Next consider the case where x ± 0 are both real. From the expression for x 0 given by Equation this means that ( ) πσ P(ω ) ln > 0, b a P(ω ) 5

16 or > P(ω ) πσ P(ω ) b a. What we notice about this last expression is that it is exactly Equation evaluated at x = µ. Since µ (x 0,x+ 0 ) this means that all points in the range (x 0,x+ 0 ) are classified as belonging to the first class. Thus we have shown that in the case when x ± 0 are both real we have R = (,a) (b,+ ) (x 0,x+ 0 ) R = (a,b)\(x 0,x+ 0 ). With these regions P e is given by P e = P(ω ) p(x ω )dx+p(ω ) p(x ω )dx R R = P(ω ) p(x ω )dx+p(ω ) p(x ω )dx. (a,b)\(x 0,x+ 0 ) (a,b) (x 0,x+ 0 ) Now for any x between x 0 and x+ 0 we have argued using Equation that P(ω )p(x ω ) > P(ω )p(x ω ) and thus we can bound the second term above as P(ω ) p(x ω )dx P(ω ) p(x ω )dx. (a,b) (x 0,x+ 0 ) (a,b) (x 0,x+ 0 ) Thus the above expression for P e is bounded as P e P(ω ) p(x ω )dx+p(ω ) p(x ω )dx (a,b)\(x 0,x+ 0 ) (a,b) (x 0,x+ 0 ) = P(ω ) b a p(x ω )dx < b a p(x ω )dx. This last expression is exactly like the bound presented for P e in Equation 3. The next step is to simplify the expression b p(x ω a )dx. Since p(x ω ) is N(µ,σ ) to evaluate this integral we make the change of variable from x to z defined as z = x µ to get σ b a p(x ω )dx = the expression we were to show. b µ σ a µ σ b µ σ π e z dz a µ σ = e z dz π ( ) ( ) b µ a µ = G G, σ σ π e z dz 6

17 Problem.6 (the mean of the expression ln(p(x;θ)) θ ) We compute this directly [ ] ln(p(x;θ)) E θ [ ] p(x; θ) = E p(x; θ) θ ( ) p(x; θ) p(x;θ) = p(x;θ)dx = dx p(x; θ) θ θ = p(x;θ)dx = θ θ = 0, as claimed. Problem.7 (the probability of flipping heads) We have a likelihood (probability) for the N flips given by P(X;q) = N q x i ( q) x i. The loglikelihood of this expression is then L(q) ln(p(x;q)) = (x i ln(q)+( x i )ln( q)). To find the ML estimate for q, we will maximize ln(p(x;q)) as a function of q. To do this we compute N dl dq = ( xi q x ) i = 0. q When we solve for q in the above we find q = N x i. Problem.8 (the Cramer-Rao bound) When we consider the ML estimate of the mean µ of a Gaussian multidimensional random variable with known covariance matrix Σ we are lead to consider the loglikelihood L(µ) given by L(µ) = N ln((π)l Σ ) (x k µ) T Σ (x k µ), (4) k= 7

18 with l = and Σ = σ we have the loglikelihood, L(µ), given by L(µ) = N ln(πσ ) (x k µ). σ Todiscuss thecramer-raolower boundwebeginwiththeassumptionthatwe haveadensity for our samples x that is parametrized via the elements of a vector θ such that x p(x;θ). We then consider any unbiased estimator of these density parameters, θ, denoted as ˆθ. If the estimator is unbiased, the Cramer-Rao lower bound then gives us a lower bound on the variance of this estimator. The specific lower bound on the variance of ˆθ is given by forming the loglikelihood ( N ) L(θ) ln p(x i ;θ), k= and then taking the expectation over partial derivatives of this expression. Namely, construct the Fisher Matrix which has i,jth element given by [ ] L(θ) J ij E. θ i θ j Then the variance of the ˆθ i must be larger than the value of J ii. As an equation this statement is that E[(ˆθ i θ i ) ] J ii. If we happen to find an estimator ˆθ i where the above inequality is satisfied as an equality then our estimator is said to be efficient. From the above discussion we see that we need to evaluate derivatives of L with respect to the parameters of the density. We find the first two derivatives of this expression with respect to µ are given by Thus we get Since we have shown that E L µ = σ L µ = σ [ ] L E µ k= (x k µ) k= = N σ. k= = N σ = σ N = var(ˆµ ML ). [ ] L = ˆµ µ ML the ML estimator for the mean ˆµ ML is efficient. Now consider the case where the unknown parameter of our density is the variance σ then we have L = N σ σ + (x k µ) (σ ) L (σ ) = N k= (σ ) N 8 k= (x k µ) (σ ) 3.

19 Recall that E[(x k µ) ] = σ since x k N(µ,σ ) and we can evaluate the expectation of the second derivative of L with respect to σ as Thus [ ] L E = N (σ ) (σ ) N [ ] L E = N (σ ) k= σ (σ ) 3 = N (σ ) = (σ ) N/. (σ ). To consider whether the ML estimator of the variance is efficient (i.e. satisfied the Cramer- Rao lower bound) we need to determine if the above expression is equal to var( ˆσ ML). Then from Problem.9 below the ML estimator of σ in the scalar Gaussian case is given by ˆσ ML = (x i ˆµ ML ), N where ˆµ ML is the ML estimate of the mean or N N k= x k. Then it can be shown that ˆσ ML has a chi-squared distribution with N degrees of freedom, in the sense that ˆσ ML σ N χ N Thus since the expectation and variance of a chi-squared distribution with N degrees of freedom is N and (N ) respectively we have that E[ ˆσ ML] = σ N (N ) = N N σ var(ˆσ ML) = σ4 (N ) N(N ) = σ 4. N From the given expression for var( ˆσ ML) we see that [ ] var( ˆσ ML) E L, (σ ) and thus the ML estimate of σ is not efficient. Problem.9 (ML with the multidimensional Gaussian) We recall that the probability density function for the multidimensional Gaussian is given by p(x µ,σ) = (π) l/ Σ exp{ / (x µ)t Σ (x µ)}. distribution 9

20 So the loglikelihood L when given N samples from the above distribution is given by N L = ln( p(x k µ,σ)) = k= ln(p(x k µ,σ)). The logarithm of the probability density function for a multidimensional Gaussian is given by so that the above becomes k= ln(p(x k µ,σ)) = (x µ)t Σ (x µ) ln((π)d Σ ), L = (x k µ) t Σ (x k µ) N ln((π)d Σ ). k= To evaluate the maximum likelihood estimates for µ and Σ we would notionally take the derivative with respect to these two parameters, set the resulting expression equal equal to zero, and solve for the parameters. To do this requires some vector derivatives. Remembering that x xt Mx = (M +M t )x, we see that the derivative of L with respect to µ is given by N L µ = Σ (x k µ) = 0, k= since Σ is a symmetric matrix. On multiplying by the covariance matrix Σ on both sides of the above we have x k µ N = 0 or µ = x k. N k= So themaximum likelihoodestimate ofµis just the sample meanasclaimed. To evaluate the maximum likelihood estimate for Σ, we begin by instead computing the maximum likelihood estimate for Σ. In terms of M Σ we have L given by L(M) = k= (x k µ) t M(x µ) N ln((π)d )+ N ln( M ). k= To evaluate the derivative of the above it is helpful to recall two identities regarding matrix derivatives. The first involves the logarithm of the determinant and is ln( M ) M = (M ) t = (M t ) = M, since M and M are both symmetric. The second involves the matrix derivative of scalar form x t Mx. We recall that M (at Mb) = ab t. Using these two identities we have that the M derivative of L is given by L M = (x k µ)(x k µ) t + N M. k= 0

21 When we set this equal to zero and then solve for M we get M = N (x k µ)(x k µ) t. k= Since M = Σ and we evaluate µ at the maximum likelihood solution given above, the previous expression is what we were to show. Problem.0 (the ML estimate in an Erlang distribution) The likelihood function for θ given the N measurements x i from the Erlang distribution is P(X;θ) = N θ x i e θx i u(x i ). Since u(x i ) = for all i this factor is not explicitly needed. From this expression the loglikelihood of X is given by L(θ) ln(p(x;θ)) = = ln(θ)n + (ln(θ)+ln(x i ) θx i ) ln(x i ) θ x i. To find this maximum of this expression we take the derivative with respect to θ, set the expression to zero, and solve for θ. We find means Which has a solution for θ given by dl dθ = 0, N N θ x i = 0. θ = N N x, i as we were to show. Problem. (the ML estimate occurs at a maximum) When we consider the ML estimate of the mean µ of a Gaussian multidimensional random variable with known covariance matrix Σ we are lead to consider the loglikelihood L(µ) given

22 by Equation 4. We have shown that the first derivative of L(µ) with respect to µ is given by L(µ) µ = N Σ (x k µ) = Σ x k NΣ µ. k= The second derivative of L(µ) with respect to µ is then k= L(µ) µ = NΣ. Since the matrix Σ is positive definite, we have that Σ is positive definite, and so NΣ is negative definite. That the matrix second derivative is negative definite is the condition for the solution to L(µ) µ = 0 to be a maximum of the objective function L(µ). Problem. (p(x X) is normal with a given mean and covariance) We assume that once we know the mean µ the sample x is drawn from p(x µ) N(µ,σ ) and that the true mean µ is random itself and given by a random draw from another normal distribution p(µ) N(µ 0,σ0 ). Then using Bayes rule we can compute what the a posteriori distribution of µ is after having observed the data set X as p(µ X) = p(x µ)p(µ) p(x) As a function of µ the expression p(x) is a constant. Assuming independence of the data samples x i in X we conclude N p(x µ) = p(x k µ). k= Thus combining all of these expressions we see that p(µ X) = αp(µ) N p(x k µ) k= ( = α exp πσ0 = α exp = α exp = α exp [ [ [ { ( N ( µ k= ( µ N µ σ σ [ (N }) (µ µ 0 ) N σ 0 k= σ x kµ σ + x k σ σ + σ 0 ) x k + N σ k= ) µ ( σ. ( { exp πσ + µ µµ 0 +µ 0 σ 0 k= k= x k + µ σ 0 x k + µ 0 σ 0 µµ 0 ) µ σ 0 ]] }) (x k µ) )] σ + µ 0 σ 0 )]. (5) Note that we have written p(µ) on the outside of the product terms since it should only appear once and not N times as might be inferred by had we written the product as

23 N k= p(µ)p(x k µ). From Equation 5 we see that the density p(µ X) is Gaussian, due to the quadratic expression for µ in the argument of the exponential To find the values of the mean and variance of this Gaussian define σn to be such that σ N = N σ + σ 0 σ N = N σ + σ 0 = σ σ0, (6) Nσ0 +σ Now defining x as N N k= x k and take µ N as the value that makes the ratio µ N /σ N equal to the coefficient of µ in the argument of the exponential above or or solving for µ N we get µ N = = µ N σ N = N x σ + µ 0, σ0 ( )( σ σ0 N x Nσ0 +σ σ + µ ) 0 = N xσ 0 +µ 0σ σ0 Nσ0 +σ ( ) Nσ 0 σ x+ Nσ0 +σ Nσ0 +σµ 0. Once we have defined these variables we see that p(µ X) is given by { p(µ X) = α exp ( µ µ )} Nµ σn σn { = α exp ( )} µ µ N µ+µ N σn { = α exp } (µ µ N ), showing that p(µ X) is a Gaussian with mean µ N and variance σn. Next we are asked to compute p(x X) where using the above expression for p(µ X) we find p(x X) = p(x, µ X)dµ = p(x µ, X)p(µ X)dµ = p(x µ)p(µ X)dµ = σ N ( ) ( e (x µ) σ e πσ πσn (x µ N ) { = exp [ ]} x xµ+µ + µ µµ N +µ N dµ. πσσ N σ σn Grouping terms in the argument of the exponent gives the expression [( + ( x )µ σn σ σ + µ ) ( )] N x µ+ σn σ + µ N. σn In prob integration.nb we integrate the above to get p(x X) = π e (x µ N ) σ +σ N πσσ N + σ σn = e (x µ N ) σ +σ π σ +σn N, σ N ) dµ 3

24 or a Gaussian distribution with mean µ N and variance σ +σ N. Problem.3 (the MAP estimate of µ for a Rayleigh distribution) The likelihood of observing all N measurements x i from a normal N(µ,σ ), where µ is from a Rayleigh distribution probability density function is given by ( N ) l(µ) = p(x i µ,σ ) p(µ), where p(µ) is the Rayleigh density p(µ) = µexp( µ /(σµ)). Then the loglikelihood becomes σµ L(µ) = ln(l(µ)) = ln(p(µ))+ ln(p(x i µ,σ )) = ln(µ) µ ln(σ σ µ)+ µ [ ln( ) (x ] i µ) πσ σ = ln(µ) µ ln(σ σµ µ )+N ln( ) πσ σ (x i µ). Now to maximize L(µ) take the derivative with respect to µ, set the result equal to zero and solve for µ. We find dl dµ = µ µ + N (x σµ σ i µ) = 0. or ( σ µ + N ( )µ σ σ ) ( N x i )µ = 0. Defining the coefficient of µ to be R and the coefficient of µ to be Z we have using the quadratic equation that the ML solution for µ is given by ( µ = Z ± Z +4R = Z ) ± + 4R, R R Z the desired result since we must take the positive sign in the above expression since µ > 0. Problem.4 (ML with the lognormal distribution) For the lognormal distribution the density is given by ) p(x) = ( σx π exp (ln(x) θ) σ x > 0, (7) 4

25 so the loglikelihood of observing N specific samples x i is given by L(θ) = ln(p(x i )) = ) (ln(x i ) θ) (ln(σx i π)+. σ Then dl dθ = 0 is given by dl Solving for θ we get N dθ = θ = N (ln(x i ) θ) σ = 0. ln(x i ). Problem.5 (maximum entropy estimation of p(x) with known mean and variance) We want to maximize p(x)ln(p(x))dx subject to the following constraints p(x)dx = xp(x)dx = µ (x µ) p(x)dx = σ. This is the same as minimizing the negative of the above entropy expression. Adding the constraints to form the Lagrangian, our constrained minimization problem becomes ( ) H L = p(x)ln(p(x))dx λ p(x)dx ( ( ) λ xp(x)dx µ ) λ 3 (x µ) p(x)dx σ. Taking the p derivative of this expression and setting the result equal to zero gives H L = ln(p(x))dx + dx λ dx p λ xdx λ 3 (x µ) dx = 0. We next solve for the expression, ln(p(x))dx, and then take the derivative of the resulting expression to get ln(p(x)) = ( λ λ x λ 3 (x µ) ). 5

26 Thus p(x) is given by p(x) = e ( λ λ x λ 3 (x µ) ). We now need to evaluate the Lagrangian multipliers λ, λ, and λ 3. To do this we use the three known constraints which we write in terms of the known functional form for our density p(x) as e ( λ λ x λ 3 (x µ) ) dx = xe ( λ λ x λ 3 (x µ) ) dx = µ (x µ) e ( λ λ x λ 3 (x µ) ) dx = σ. We next perform the integrations on the left-hand-side of the above expressions in the Mathematica file prob 5.nb. We then solve the resulting three equations for λ, λ, and λ 3. When we do that we find λ = ( ) e 4 4 ln = 4π σ 4 ln(πσ ) λ = 0 λ 3 = σ. Thus with these values for λ the form of p(x) is p(x) = e ( λ λ 3 (x µ) ) = e ln(πσ) e (x µ) σ = e (x µ) σ, πσ or a Gaussian distribution, N(µ,σ ), as claimed. Problem.6 (the derivation the EM algorithm) Continuing the discussion in the book on the EM algorithm we will present the derivation of the algorithm in the case where the density of our samples x k is given by a multidimensional Gaussian. That is we assume that p(x k j;θ) = N(x k µ j,c j ). Then following the book, for the E-step of the EM algorithm we take the expectation over the unobserved sample cluster labels which we denote as P(j k x k ;Θ(t)) as Q(Θ;Θ(t)) = = J P(j k x k ;Θ(t))ln(p(x k j k,θ)p jk ) k= j k = k= j= J P(j x k ;Θ(t))ln(p(x k j,θ)p j ). (8) 6

27 From the assumed form for p(x k j;θ), its natural logarithm is given by ln(p(x k j;θ)) = ln((π)l C j ) (x k µ j ) T C j (x k µ j ). With this expression and dropping constants that won t change the value of the subsequent maximization, for Q(Θ; Θ(t)) we get the following expression J [ P(j x k ;Θ(t)) ln( C j ) ] (x k µ j ) T C j (x k µ j )+ln(p j ). k= j= For the M-step we maximize over the parameters µ j, C j and P j the above expression, subject to the constraint that j P j =. The classical way to solve this maximization is using the method of Lagrange multipliers. In that method we would extend Q(Θ; Θ(t)), creating a new objective function Q (Θ;Θ(t)), to include a Lagrange multiplier (denoted by λ) to enforce the constraint that j P j = as Q (Θ;Θ(t)) = k= j= J [P(j x k ;Θ(t))ln(N(x k µ j,c j ))+P(j x j ;Θ(t))ln(P j )] ( J ) λ P j. (9) j= We then proceed to maximize this expression by taking derivatives with respect to the variables µ j, C j, and P j, setting the resulting expressions equal to zero, and solving the resulting equations for them. We begin by taking µ j of Q (Θ;Θ(t)). We find µ j Q (Θ;Θ(t)) = k= ( P(j x k ;Θ(t)) The derivative required in the above is given by Thus N(x k µ j,c j ) = N(x k µ j,c j ) µ j µ j N(x k µ j,c j ) = N(x k µ j,c j ) ( C j +C T j )( ) N(x k µ j,c j ). µ j ( ) (x k µ j ) T C j (x k µ j ) ) (xk µ j ) = N(x k µ j,c j )C j (x k µ j ). (30) µ j Q (Θ;Θ(t)) = Setting this expression equal to zero and solving for µ j we have k= P(j x k ;Θ(t))C j (x k µ j ). (3) µ j = N k= P(j x k;θ(t))x k N k= P(j x. (3) k;θ(t)) Next we take the derivative of Q (Θ;Θ(t)) with respect to C j. Which we will evaluate using the chain rule transforming the derivative with respect to C j into one with respect to C j. We have Q (Θ;Θ(t)) = C j C Q (Θ;Θ(t)) C j. j C j 7

28 Thus if C j Q (Θ;Θ(t)) = 0, we have that C j Q (Θ;Θ(t)) = 0 also. From this we can look for zeros of the derivative by looking for values of C j where the derivative of the inverse of C j vanishes. Taking the derivative of Q (Θ;Θ(t)) with respect to C j we find C j Q (Θ;Θ(t)) = = P(j x k ;Θ(t)) C ln(n(x k µ j,c j )) k= j ( ) P(j x k ;Θ(t)) N(x k µ j,c j ) C N(x k µ j,c j ). j k= From which we see that as a sub problem we need to compute C j N(x k µ j,c j ), which we now do C j N(x k µ j,c j ) = = + ( C j (π) l (π) l C j C j ( C j / { exp / ) (π) l C j / C exp j }) (x k µ j ) T C j (x k µ j ) { exp } (x k µ j ) T C j (x k µ j ) { } (x k µ j ) T C j (x k µ j ), using the product rule. To evaluate the first derivative in the above we note that ( ) = C j C j / C C j / j = C j / C j C j, but using the following matrix derivative of a determinant identity X AXB = AXB (X ) T = AXB (X T ), (33) with A = B = I we have X X = X (X ) T and the derivative C j ( ) C j C j / = C j / C j C j T = C j /C j. Next using the matrix derivative of an inner product is given by ( ) C j / becomes X (at Xb) = ab T, (34) we have the derivative of the inner product expression { } C j (x k µ j ) T C j (x k µ j ) = (x k µ j )(x k µ j ) T. 8

29 Putting everything together we find that C N(x k µ j,c j ) = { exp } j (π) l C j / (x k µ j ) T C j (x k µ j ) N(x k µ j,c j )(x k µ j ) T (x k µ j ) So combining these subproblems we finally find = N(x k µ j,c j ) ( C j (x k µ j )(x k µ j ) T). (35) C ln(n(x k µ j,c j )) = ( Cj (x k µ j )(x k µ j ) T). (36) j Using this in the expression for C j Q (Θ;Θ(t)) = 0, we find the equation P(j x k ;Θ(t))C j k= Which when we solve for C j we find P(j x k ;Θ(t))(x k µ j )(x k µ j ) T = 0. k= C j C j = N k= P(j x k;θ(t))(x k µ j )(x k µ j ) T N k= P(j x k;θ(t)). (37) Warning: The above has µ j meaning the old value of µ j rather than µ j (t + ) the newly computed value via Equation 3. I m a bit unclear hear as to whether or not this matters, is a typo, or something else. If anyone has any information on this please contact me. Chapter 4 of this book also discusses the expectation maximization algorithm and has an equivalent formulation to the one above. In situations like this if we replace µ j with µ j (t+) we get faster convergence. To complete a full maximization of Q (Θ;Θ(t)) with we still need to determine P j the priori probabilities of the k-th cluster. Setting Q (Θ;Θ(t)) P j = 0 gives or k= P(j x k ;Θ(t)) P j λ = 0, λp j = P(j x k ;Θ(t)). Summing this equation over j for j = to J since J j= P j = we have λ = J j= This can be simplified by observing that λ = J j= k= k= P(j x k ;Θ(t)) = P(j x k ;Θ(t)). j= k= j= 9 J P(j x k ;Θ(t)) = = N. k=

30 Where we have used the fact that P(j x k ;Θ(t)) is a probability that the sample x k is from cluster j. Since there are,,...,j clusters summing this probability gives one. Thus P j = N P(j x k ;Θ(t)). (38) k= Combining this expression with Equations 3 and 37 gives the EM algorithm. Problem.7 (consistency of the counting density) We are told that k is distributed as a binomial random variable with parameters (P,N). This means that the probability we observe the value of k samples in our interval of length h after N trials is given by ( ) N p(k) = P k ( P) N k for 0 k N. k We desire to estimate the probability of success, P, from the measurement k via the ratio k. N Lets compute the expectation and variance of this estimate of P. The expectation is given by [ ] k E N P = N E[k P]. Now since k is drawn from a binomial random variable with parameters (N,P), the expectation of k is PN, from which we see that the above equals P (showing that our estimator of P is unbiased). To study the conditional variance of our error (defined as e = ˆP P) consider σe (P) = E[(e E[e]) P] ( ) = E[ N k P P] = N E[(k NP) P] = P( P) N(NP( P)) =. (39) N In the above we have used the result that the variance of a binomial random variable with parameters(n,p)isnp( P). Thuswe have shownthattheestimator k N isasymptotically consistent. Problem.8 (experiments with the EM algorithm) See the MATLAB file chap prob 8.m for an implementation of this problem. When that code is run we get the following output mu_true = ; ; ; 30

31 mu_j =.06347; ; ; s_true = ; ; ; sigma_j= ; ; ; p_true = ; ; ; P_j = ; ; ; Note that when we run the EM algorithm, and it converges to the true solution, the actual ordering of the elements in the estimated vectors holding the estimated values of µ, σ, and P j does not have to match the ordering of the truth. Thus in the above output we explicitly permute the order of the estimated results to make the estimates line up with the true values. Problem.30 (nearest neighbors classification) See the MATLAB file chap prob 30.m for an implementation of this problem. When that code is run and we classify the test samples using the nearest neighbor classifier and for various numbers of nearest neighbors we get the following probability of error output P_e NN= ; P_e NN= ; P_e 3NN= ; P_e 4NN= ; P_e 5NN= ; P_e 6NN= ; P_e 7NN= ; P_e 8NN= ; P_e 9NN= ; P_e 0NN= ; P_e NN= ; P_B= It can be shown that P 3NN P B +3P B, (40) and thus since P B 0. we see that P 3NN using the above formula. This value is in the same ballpark as the empirical value obtained above. 3

32 .4. truth N=3; h=0.05 N=3; h= truth N=56; h=0.05 N=56; h= truth N=5000; h=0.05 N=5000; h= Figure : Left: Parzen window probability density estimation with N = 3 points. Center: Parzen window probability density estimation with N = 56 points. Right: Parzen window probability density estimation with N = 5000 points. Problem.3 (Parzen window density estimation) Parzen window density estimation with N(0,) for a kernel function, φ, means that we take φ(x) = π e x and then estimate our density p(x) for l dimensional features vectors using ( ˆp(x) = h l N ( ) ) xi x φ. (4) h Inthisprobleml,thedimensionofthefeaturevectors, is. SeetheMATLABfilechap prob 3.m for an implementation of this problem. When that code is run we get the plots shown in Figure. Each plot is the density estimate based on N points where N is 3, 56, or 5000 and for h given by 0.05 and 0.. Problem.3 (k nearest neighbor density estimation) For k-nearest neighbor density estimation we approximate the true density p(x) using samples via k ˆp(x) = NV(x), (4) where in this expression k and N are fixed and V(x) is the volume of a sphere around the point x such that it contains the k nearest neighbors. For -dimensional densities this volume is a length. Thus the procedure we implement when given a point x where we want to evaluate the empirical density is to. Find the k nearest neighbors around the point x. 3

33 truth k=3 k=64 k= Figure : k nearest neighbor density estimation with k = 3, 64, and 56 points.. Compute V(x) as the length between the largest and smallest points in the above set. See the MATLAB file chap prob 3.m for an implementation of this problem. When that code is run we get the plots shown in Figure. Note that these density estimates seems very noisy and this noise decreases as we take more and more neighbors. 33

34 Linear Classifiers Notes on the text Notes on Linear Discriminant Functions: the derivation of d In two-dimensions, our weight vector and feature vectors have two components so w T = (w,w ) and our discriminant function g(x) is given by g(x) = g(x,x ) = w x +w x +w 0. Then g(x) = 0 is a line that will intersect the x axis at and the x axis at x = 0 w x +w 0 = 0 or x = w 0 w, x = 0 w x +w 0 = 0 or x = w 0 w. Plotting a line that goes though the points (0, w 0 w ) and ( w 0 w,0) and assuming w 0 < 0 gives Figure 3. presented in the book. We now want to derive the expressions for d and z given in this section. To do that lets denote P be the point on the decision line that is closest to the origin and then d be the distance from the origin to this point P. Since P is the closest point to (0,0) the vector from (0,0) to P must be orthogonal to the decision line i.e. parallel to the vector w T = (w,w ). Thus we seek to determine the value of d such that the point that is d away from (0,0) and in the direction of of w i.e. ( ) w w dŵ = d,, w +w w +w is on the discriminant line g(x,x ) = 0. When we put in the above two components for this point into g(x,x ) = 0 we get that dw w +w + dw w +w When we solve for d in the above expression we get w 0 d =. w +w +w 0 = 0. Since we are assuming that w 0 < 0 we see that d can also be written as d = w 0. (43) w +w We can also obtain this ( formula ) ( using similar ) triangles. If we note that the big triangle with vertices s (0, 0), w 0 w,0, w 0 w,0 is similar to the small triangle with vertices s P, 34

35 ( ) w 0 w,0, and (0,0) in that they are both right triangles and have a common acute angle ( ) (with vertex at w 0 w,0 ). The ratio of the hypotenuse to leg of the triangle opposite the common acute angle must be equal in both triangle or ( ) ( ) 0+ w 0 w + 0+ w 0 w w 0 w d When we solve this expression for d we get the same expression as before. = ( w 0 d = w 0 w w w ) + ( w 0 w ) = w 0 w. w 0, w +w Notes on Linear Discriminant Functions: the derivation of z We now seek to derive the expression for z, the distance from any point not on the decision surface to the decision surface. Let x be a point not on the decision hyperplane and then define z to be the distance from x to the closest point on the decision line. In the earlier part of these notes we derived the expression for z when the point not on the decision line was the point (0,0). Using this it might be easier to compute the value of z if we first translate the decision line so that it passes thought the origin. To do that we subtract dŵ from every point in the R plane, where d = w 0 w +w and ŵ = ( w w, w +w w +w ). Then the new coordinates, x, are then given by x = x dŵ. In the translated space where our decision surface passes though the origin the points x can be decomposed into a component in the direction of ŵ and in a direction perpendicular to ŵ which we denote ŵ. The vector ŵ has components that are related to tho of ŵ as ŵ = ( w w, w +w w +w ). Then given these two vectors we can decompose x as x = (ŵ T x)ŵ +(ŵ T x)ŵ. Then the distance from a point x to the line w T x = 0 or z will be the absolute value of the coefficient of the vector ŵ above and (assuming it is positive meaning that x is to the right of the decision line) we find z = ŵ T x. 35

36 In terms of the original variables we have z given by z = ŵ T (x dŵ) = ŵ T x d = w x +w x w +w w 0. w +w If w 0 < 0 then w 0 = w 0 and the above becomes z = w x +w x +w 0 w +w = g(x). w +w Which is the expression we wanted to show. Notes on the Perceptron Algorithm If x ω and x is misclassified by the perceptron algorithm then w T x < 0 so if we take δ x = then the product δ x (w T x) > 0. If x ω and is misclassified then the product w T x > 0 so if we take δ x = + then δ x (w T x) > 0. In all cases, where a sample x is misclassified, each term in the perceptron cost function J(w) = x Y δ x w T x, (44) is positive. Notes on the Proof of the Perceptron Algorithm Convergence When we begin with the gradient decent algorithm applied to the perceptron cost function J(w) given by Equation 44. This algorithm is w(t+) = w(t) ρ t δ x x, where Y is the set samples misclassified by the current weight vector w(t). Since the set Y depends on the current weight vector w(t) we could indicate this with the notation Y(t) if desired. We would like to now show that this algorithm converges to a vector that is parallel to the optimal vector w. A vector parallel to w is any vector of the form αw and this is what we subtract from w(t+) above to get w(t+) αw = w(t) αw ρ t δ x x. Then squaring both sides of the above we get w(t+) αw = w(t) αw +ρ t x Y 36 x Y x Y δ x x ρ t δ x (w(t) αw ) T x. x Y

37 Since the vector w(t) is not optimal it will misclassify some sample points. Thus the negative of the perceptron cost function J(w) or x Y δ xw(t) T x is a negative number. This gives the upper bound on w(t+) αw of w(t+) αw w(t) αw +ρ t x Y δ x x +ρ t α x Y δ x w T x. Now since the expression only x Y δ xx depends on the training data and not on the algorithm used to compute w(t) we can introduce β such that β max Y x Y δ x x, thus β is largest possible value for the given sum. At this point we have w(t+) αw w(t) αw +ρ t β +ρ t α x Y δ x w T x. Since w is a solution vector from how x is classified and the definition of δ x as discussed on Page 36 we know that x Y δ xw T x < 0, since each term in the sum is negative. As with the introduction of β the above sum is over training points so we can introduce γ as γ max Y x Y δ x w T x < 0. For any fixed set Y we then have δ x w T x < γ < 0, x Y by the definition of γ. Now since γ < 0 we can write γ = γ and thus have δ x w T x < γ = γ, Using this result we thus are able to bound w(t+) αw as x Y w(t+) αw w(t) αw +ρ t β ρ t α γ. (45) Up to this point we have been studying the distance between w(t) and an arbitrary vector αw thatisparallel tow. Letsnowconsider howthedistancebetween w(t)andαw behaves when α = β. Using that value in the above expression we find that γ and the bound above become ρ t β ρ t α γ = ρ t β ρ t β = β (ρ t ρ t), w(t+) αw w(t) αw +β (ρ t ρ t). 37

38 .5 pocket algorithm computed decision line class class Figure 3: The decision surface produced when running the script pocket algorithm.m. Note that this data is not linearly separable. Writing out these expressions for t = 0,,,... gives w() αw w(0) αw +β (ρ 0 ρ 0) w() αw w() αw +β (ρ ρ ) w(0) αw +β (ρ 0 ρ 0 )+β (ρ ρ ) ( ) = w(0) αw +β ρ k ρ k k=0 k=0 w(3) αw w() αw +β (ρ ρ ) ( w(0) αw +β ρ k ρ k )+β (ρ ρ ) k=0 k=0 ( ) w(0) αw +β ρ k ρ k k=0 k=0. ( t ) t w(t+) αw w(0) αw +β ρ k ρ k. (46) k=0 k=0 Notes on the Pocket algorithm See the MATLAB script pocket algorithm.m for code that implements the Pocket Algorithm. An example decision surface when this script is run is given in Figure 3. 38

39 3 Kesler construction computed decision line Figure 4: The three decision lines produced when running the script dup example 3.m. Duplication of Example 3. (Kesler s construction) See the MATLAB script dup example 3.m for code that duplicates the numerical results from this example. When this code is run it produces a plot like that shown in Figure 4. The previous command also verifies that w T j x i > w T k x i, when x i ω j by computing these inner products. We find class= ; w^t x=.4449; w^t x= ; w3^t x= class= ; w^t x=.709; w^t x= ; w3^t x= class= ; w^t x= 6.638; w^t x= -.33; w3^t x= class= ; w^t x= ; w^t x= 6.68; w3^t x= class= ; w^t x= ; w^t x=.934; w3^t x= class= ; w^t x= ; w^t x= 4.377; w3^t x= class= 3; w^t x=.0070; w^t x= -8.86; w3^t x= class= 3; w^t x= ; w^t x= ; w3^t x= 0.77 class= 3; w^t x= -.70; w^t x= ; w3^t x= verifying that indeed we are correctly classifying all of the training data. Notes on Mean Square Error Estimation Consider the objective function J(w) given by J(w) = E[ y x T w ], where the expectation in the above is taken with respect to the joint density (X,Y). To evaluate this we will use the idea of iterated expectation where E[X] = E[E[X Y]]. 39

40 Since the Y variable is discrete Y = ± and once Y is chosen the x density is conditional on the value of y we have that J(w) is given by J(w) = E[ y x T w ] = E [ y x T w Y = ] P(Y = )+E [ y x T w Y = + ] P(Y = +) = E [ x T w Y = ] P(ω )+E [ x T w Y = + ] P(ω ) = P(ω ) (+x T w) p(x ω )dx+p(ω ) ( x T w) p(x ω )dx. (47) Notes on the Multiclass Generalization The text states lets define y = [y,...,y M ] for a given vector x but does not explicitly say how to define it. If we assume that the sample x ω j then the vector y should be all zeros except for a single one in the jth position. This procedure of encoding the class label as a position in a vector is known as position encoding. Duplication of Example 3.3 See the MATLAB script dup example 3 3.m for code that duplicates the numerical results from this example. Notes on support vector machines in the linearly separable case The book discusses the motivation for the support vector machine optimization problem in the case when the points are linearly separable. This problem is to compute the parameters, w and w 0 of the decision hyperplane w T x+w 0 = 0 such that minimize J(w) w (48) subject to y i (w T x i +w 0 ) for i =,,,N. (49) This is a minimization problem with inequality constraints. The necessary conditions for its minimum are given by the Karush-Kuhn-Tucker (KKT) conditions. To introduce these we need to form the Lagrangian L given by L(w,w 0,λ) = wt w λ i [y i (w T x i +w 0 ) ]. (50) Then the KKT conditions for the above problem are: L w = 0 and L w 0 = 0. 40

41 λ i 0 for i =,,,N λ i (y i (w T x i +w 0 ) ) = 0 for i =,,,N. We now need to use these conditions to find the optimum. One way to do this that might work for small problems is to assume some of the constraints are active i.e. y i (w T x i +w 0 ) = 0, for some set of i s. This is equivalent to fixing/determining the support vectors. Then we solve the resulting equations for the remaining Lagrange multipliers λ i and verify that they are in fact non-negative. We can also solve this optimization problem if we recognize that J(w) is a strictly convex function and apply the Wolfe Dual representation to this problem. This states that the above minimization problem is equivalent to the following maximization problem maximize L(w,w 0,λ) (5) subject to L w = 0 and L = 0 w 0 (5) and λ i 0 for i =,,,N. (53) To use this representation we take the form for L(w,w 0,λ) given above and find that L = 0 w gives N L w = w λ i y i x i = 0, (54) while L w 0 = 0 is L w 0 = λ i y i = 0. (55) Putting these two constraints given by Equation 54 and 55 into Equation 50 we find ( L = N ) T ( N [ N ] λ i y i x i λ i y i x i ) λ i y i λ j y j x T j x i +y i w 0 = = λ i λ j y i y j x T i x j j= λ i λ j y i y j x T i x j + j= j= λ i λ j y i y j x T j x N i w 0 λ i y i + j= λ i, (56) for L. We want to maximize the above expression under the two constraints on λ i that N λ iy i = 0 (Equation 55) with λ i 0 (Equation 53). This can be done using quadratic programming and is an easier problem because the complicated inequality constraints from Equation 49 have now been replaced with equality constraints via Equation 55. Once we have computed λ i with a quadratic programming algorithm we then compute w using w = N λ iy i x i. The computation of w 0 is done by averaging the complementary slackness λ i conditions or λ i [y i (w T x i +w 0 ) ] = 0 for i =,, N. 4

42 Notes on support vector machines in the non-separable case When the data points are non-separable the optimization problem in the separable case changes to compute the parameters, w and w 0 of the decision hyperplane w T x + w 0 = 0 such that minimize J(w,w 0,ξ) w +C ξ i (57) subject to y i (w T x i +w 0 ) ξ i for i =,,,N (58) and ξ i 0 for i =,,,N. (59) This is again a minimization problem with inequality constraints. We form the Lagrangian L given by L(w,w 0,ξ;λ,µ) = wt w+c µ i ξ i ξ i λ i [y i (w T x i +w 0 ) +ξ i ]. The the necessary KKT conditions for this Lagrangian are that L w = 0 or while L w 0 = 0 is N L w = w λ i y i x i = 0, (60) L w 0 = λ i y i = 0, (6) while L ξ i = 0 is L = C µ i λ i = 0, (6) ξ i for i =,,,N. Plus the conditions that the Lagrange multipliers are non-negative λ i 0, µ i 0 for i =,,,N and the complementary slackness conditions: λ i (y i (w T w i +w 0 ) ξ i ) = 0 µ i ξ i = 0 for i =,,,N. To solve this we can again apply the Wolfe dual representation which states that the above minimization problem is equivalent to the maximization problem maximize L(w,w 0,ξ;λ,µ) (63) subject to L w = 0, L = 0 and L w 0 ξ = 0 or w = λ i y i x i, λ i y i = 0 and C µ i λ i = 0, and λ i 0 and µ i 0 for i =,,,N. (64) 4

43 Using these constraints into the definition of L we find ( L = N ) T ( N λ i y i x i λ i y i x i )+C ξ i (C λ i )ξ i ( N ) N λ i y i λ j y j x T j x i w 0 λ i y i + λ i λ i ξ i = j= λ i λ j y i y j x T i x j + j= λ i, (65) which is another quadratic programming problem to be maximized over the vectors of Lagrangianmultipliers λ i andµ i. Sincetheelements µ i don tappearinthewolfedualobjective function above the maximization takes place over the variable λ i alone just as in the separable case above. What changes however is to recognize that the constraints C µ i λ i = 0, or µ i = C λ i means that µ i 0 implies C λ i 0 or the λ i constraint of λ i C. Problem Solutions Problem 3. (the perceptron cost function is continuous) That the perceptron objective function J(w) = x Y δ x w T x, where Y is the set of points misclassified by the weight vector w, is continuous as a function of w can be argued as follows. If changing w does not change the set Y of misclassified samples then we can write J(w) as J(w) = ( T δ x x T w = δ x x) w = α T w, x Y or the product of a fixed vector, defined here to be α and the weight vector w. This is a continuous function. If changing w causes a point x go in or out of the misclassified set, Y, then aroundthevalue of w that causes thischange J(w) will change by ±δ x (w T x). The point to note is that for a point x to go from correctly classified to incorrectly classified means that w T x must pass thought zero, since for one sign w T x will classify the point correctly while for the other sign it will not. The fact that this additional term ±δ x (w T x) is continuous as a function of w imply that the full objective function J(w) is continuous. x Y Problem 3. (if ρ k = ρ the perceptron converges) From the notes on Page 36, namely Equation 45 when ρ k does not depend on k we get w(t+) αw w(t) αw +ρ β ρα γ. (66) 43

44 Lets observe how the distance between w(t) and αw changes when α = β γ (note this is a factor of two larger than the convergence result studied on Page 36). For this value of α we find that w(t+) αw w(t) αw +β (ρ ρ). Iterating this as before gives w(t) αw w(0) αw +β (ρ ρ)t. Thus if ρ ρ < 0 then as t increases in magnitude eventually the right-hand-side of the above becomes negative or w(0) αw +β (ρ ρ)t < 0. (67) The inequality ρ ρ < 0 requires that ρ(ρ ) < 0 or that 0 < ρ <. The step k 0 where this above inequality Equation 67 first happens and we are guaranteed convergence when t k 0 w(0) αw β ρ( ρ). (68) Problem 3.3 (the reward and punishment form of the perceptron) The formofthereward and punishment form of the perceptronistostart withaninitial guess w(0) at the separating hyperplane vector w and some ordering of the input data vectors, and then iterate w(t+) = w(t)+ρx (t) if x (t) ω and w T (t)x (t) 0 w(t+) = w(t) ρx (t) if x (t) ω and w T (t)x (t) 0 w(t+) = w(t) otherwise, (69) where x (t) is the data sample to consider on the tth step. Now consider how this algorithm might approach a multiple of the optimal vector or αw. Since only the first two equations above are relevant by subtracting αw from both sides and squaring we get w(t+) αw = w(t) αw +ρ x (t) +ρx T (t) (w(t) αw ) if x (t) ω and w T (t)x (t) 0 w(t+) αw = w(t) αw ρ x (t) ρx T (t) (w(t) αw ) if x (t) ω and w T (t)x (t) 0. Let β max x (t), then the equation above become inequalities w(t+) αw w(t) αw +ρ β +ρx T (t) (w(t) αw ) if x (t) ω and w T (t)x (t) 0 w(t+) αw w(t) αw ρ β ρx T (t)(w(t) αw ) if x (t) ω and w T (t)x (t) 0. 44

45 In the first equation, since w is a true solution, and x (t) ω we have that both terms ρx T (t)w(t) and αρx T (t)w, are negative. In the second equation, again since w is a true solution, and x (t) ω both terms ρx T (t) w(t) and αρxt (t) w, are negative. Since the terms with w(t) depend on the past iterates (and the starting conditions of the algorithm) we will drop ρx T (t) w(t) from the first and ρxt (t) w(t) from the second. That means that we now have w(t+) αw w(t) αw +ρ β ραx T (t)w if x (t) ω and w T (t)x (t) 0 w(t+) αw w(t) αw ρ β +ραx T (t) w if x (t) ω and w T (t)x (t) 0. From the above discussion we know x T (t) w and x T (t) w are negative so lets take the largest possible values for x T (t) w and x T (t) w via γ = max x ω ( x T (t)w ) < 0 γ = max x ω (x T (t) w ) < 0 γ = max(γ,γ ) < 0. Then with the parameter γ both update lines collapse to w(t+) αw w(t) αw +ρ β +ραγ. Since γ < 0 we can write it as γ = γ and we get w(t+) αw w(t) αw +ρ β ρα γ. This is the same as Equation 66 and the same arguments as above show that w(t) αw as t proving the convergence of the reward and punishment form of the perceptron algorithm. In fact the arguments above show that we converge in a finite number of steps i.e. when t k 0 where k 0 is given by Equation 68. Problem 3.4 (iterating the perceptron algorithm by hand) This problem is worked in the MATLAB chap 3 prob 4.m, and when it is run produces the plot shown in Figure 5. While the reward and punishment form of the perceptron is processing the weight w(t) iterates through the following points w_i=[ 0.0, 0.0, 0.0] w_i=[ 0.0, 0.0,.0] w_i=[ -.0, 0.0, 0.0] w_i=[ -.0, 0.0,.0] w_i=[ -.0, 0.0, 0.0] w_i=[ -.0, 0.0,.0] 45

46 reward punishment perceptron computed decision line class class Figure 5: The decision line produced when running the script chap 3 prob 4.m. The classification rule is then w T x > 0 x ω, and x ω otherwise. For the final decision line where w = [ 0 ] this corresponds to a classification rule where x < x ω. Problem 3.5 (implementing the perceptron) This problem is worked in the MATLAB chap 3 prob 5.m. When this script is run we get the result shown in Figure 6 (left). Problem 3.6 (implementing the LMS algorithm) This problem is worked in the MATLAB chap 3 prob 6.m. When this script is run we get the result shown in Figure 6 (center). Problem 3.7 (Kesler s construction) Consider applying the reward and punishment form of the perceptron to the data set obtained by using Kesler s construction. In that case we order the Kesler s expanded data samples x (t) and present them one at a time to the following algorithm w(t+) = w(t)+ρ x (t) if x (t) is misclassified by w(t) i.e. w T (t) x (t) 0 w(t+) = w(t) otherwise. 46

47 3.5 perceptron computed decision line class class 3.5 LMS computed decision line class class.5 sum of square error computed decision line class class Figure 6: Data from two two-dimensional normals with mean vectors µ = [,] T, µ = [0,0] T, and σ = σ = 0.. Left: The decision region produced by the perceptron algorithm. Center: The decision region produced by the LMS algorithm. Right: The decision region produced by the sum of square error criterion. Visually, all three algorithms produce similar results. Since we are only trying to enforce w T (t) x (t) > 0 and x (t) is the Kesler s extended vector where the unexpanded vector x (t) is an element of ω i. Since w(t) in Kesler s construction is defined as w = [w T,wT,,wT M ]T and since x (t) has a +x (t) at the block spot i and a x (t) at the block spot j and zeros elsewhere the equation w(t+) = w(t)+ρ x (t), will be executed/triggered if w T (t) x (t) 0 or when or w T i x (t) w T j x (j) 0, w T i x (t) w T j w (t), and will update the block components of w(t) as as we were to show. w i (t+) = w i (t)+ρx (t) w j (t+) = w j (t) ρx (t) w k (t+) = w k (t) k i,j, 47

48 Problem 3.8 (the sum of square error optimal weight tends to MSE) Recall that the objective functions for sum of square errors (SSE) and the mean square error (MSE) criteria are J SSE (w) = (y i x T i w) J MSE (w) = E[(y x T w) ], respectively. Since the objective function J SSE (w) is a multiple of a sample based estimate of the expectation we expect that as long as the sample estimate converge to the population values we expect the SSE weight should converge to the MSE weight. Since we can write the solution vector w under the SSE criterion (by taking the derivative of the above expression with respect to w) as x i (y i x T i ŵ) = 0, by dividing by N we get that this is equivalent to ( ) x i y i x i x T i ŵ = 0. N N which as we take the limit N and assuming sample convergence to the population values becomes E[xy] E[xx T ]ŵ = 0, which is the equation that the MSE solution ŵ MSE must satisfy. Problem 3.9 (the sum of square error classifier) This problem is worked in the MATLAB chap 3 prob 9.m. When this script is run we get the result shown in Figure 6 (right). Problem 3.0 (the multiclass sum of square error criterion) If we have a M classes problem then the sum of error squares estimation would be formalized by introducing the vector y i = [y,y,...,y M ] T, where y j would take the value one, if the ith sample, x i, was a member of class ω j and would be zero otherwise. Then we introduce M vector weights w j in the matrix W defined as W = [w,w,...,w M ]. Thus the jth column of W is the vector weights w j. To specify these weights we seek to minimize the square error 48

49 over all of our training samples or J(W) = = = y i W T x i M (y ij w T j x i ) j= ( M N ) (y ij w T j x i ). j= Since this is the sum of M terms we can minimize the entire expression by picking w j for each j =,,...,M to minimize the inner summation. Thus we are doing M parallel one dimensional problems. The jth problem find the value of w j by associating to the target, y i, a value of one if x i ω j and zero otherwise. Problem 3. (jointly Gaussian random variables) For this problem we are told that the joint distribution of the pair (X,Y) is given by p X,Y (x,y) = πσ x σ y α { [ (x µx ) ( ) ]} y µy exp + α(x µ x)(y µ y ). α( α) σ x σ y σ x σ y From the expression for P X,Y (x,y) it can be shown [0], that the conditional distribution p{x Y} is given by ( p{x Y} = N µ x +α σ ) x (y µ y ),σx σ ( α ). y The book asks for the distribution of P{Y X} which can be obtained from the above by exchanging x and y in the expression above. Problem 3. (sum of error square criterion is the same as ML) If we have M classes, the statement that the classifier outputs vary around the corresponding desired response values d i k, according to a Gaussian distribution with a known variance, means that we would expect that g(x i ;w k ) N(d i k,σ ) for k =,, M. In that case if we group all of the M response for a given sample x i into a target vector like d i d i y i.. d i M 49

50 The distribution of y i is a multivariate Gaussian with a mean given by y i, and a covariance matrix given by σ I. Thus the distribution of y i is given by g(x i ;w ) g(x i ;w ). N(y i,σ I). g(x i ;w M ) Thus we expect from how y i is defined that g(x i ;w ) g(x i ;w ). g(x i ;w M ) d i d i. d i M N(0,σ I). Thus the loglikelihood L of N samples x i would then be given by N L({x i }) = ln exp (π) l σl = ln = ln ( N ( (π) l σ exp (π) ln σln g(x i ;w ) d i T g(x i ;w ) d i g(x i ;w ) d i g(x i ;w ) d i σ.. g(x i ;w M ) d i M g(x i ;w M ) d i M { [ exp M ]}) (g(x l σ i ;w k ) d k i ) k= { }) M (g(x σ i ;w k ) d k i). k= Since we will want to maximize the loglikelihood as a function of the vectors w k we can drop all things that don t depend on w and minimize the negative of what remains. We then need to minimize as a function of w k for k =,,,M the following expression L ({x i }) = M (g(x i ;w k ) d k i ). k= This is exactly the sum of squares error criterion in the multiclass case. Problem 3.3 (approximating P(ω x) P(ω x) in a MSE sense) The MSE training procedure (in the two class case) computes the decision surface f(x;w) weights w via ŵ = argmin w E [ (f(x;w) y) ] = argmin w j= (f(x;w) y) P(x,ω j )dx. 50

51 Using p(x,ω j ) = P(ω j x)p(x) and if we code each classes response as y = + when x ω and y = when x ω then the argument inside the integration in the above minimization becomes (f(x;w) ) P(ω x)+(f(x;w)+) P(ω x) = (f(x;w) f(x;w)+)p(ω x)+(f(x;w) +f(x;w)+)p(ω x) = f(x;w) f(x;w)[p(ω x) P(ω x)]+ = f(x;w) f(x;w)[p(ω x) P(ω x)] + [P(ω x) P(ω x)] [P(ω x) P(ω x)] + = [f(x;w) (P(ω x) P(ω x))] [P(ω x) P(ω x)] +, where we have used the fact that P(ω x)+p(ω x) =. Note that the last two terms above namely [P(ω x) P(ω x)] + do not depend on the the vector w and thus don t change the results of the minimization. Thus we are left with ŵ = argmin w (f(x;w) {P(ω x) P(ω x)}) p(x)dx = argmin w E [ (f(x;w) {P(ω x) P(ω x)}) ]. Thus we are picking w to have f(x;w) approximate the decision boundary P(ω x) P(ω x) optimally in the mean square error sense. Problem 3.4 (how the Bayesian and MSE classifier differ) The Bayesian classifier in the equal class covariance case is given by Equation 9. Thus we need to compute the MSE hyperplane we first extend each feature vector by adding the value of to get new vectors v defined by v = [x T,] T. The vector v is now of dimension l +. Then in this augmented space the optimal MSE linear classifier is given by computing ŵ T v. Then if ŵ T v > 0 we declare v ω and otherwise declare v ω. ŵ = Rv E[vy]. (70) [ ] w where ŵ is the augmented vector so that it includes an intercept w 0 and w 0 R x = E[vv T ] = E[v v ] E[v v l+ ] E[v v ] E[v v l+ ].. E[v l+ v ] E[v l+ v l+ ]. (7) We cannow evaluate several of these expectations. We have that vv T interms of theoriginal x is given by [ ] x [ vv T = x T ] [ ] xx T x = x T. 5

52 Taking the expectation of this gives [ ] R E[x] E[x] T. [ ] [ x E[xy] Next we compute the expectation of vy = y or E[vy] = E[y] is y = + when x ω and y = when x ω we can compute ]. Since the response E[y] = P(ω ) P(ω ) = 0, when P(ω ) = P(ω ) =. Next we find E[xy] under the same condition as E[xy] = E[x y = +]P(ω ) E[x y = ]P(ω ) = µ P(ω ) µ P(ω ) = µ µ. For completeness (and to be used later) we now compute E[x] and find E[x] = E[x x ω ]P(ω )+E[x x ω ]P(ω ) = (µ +µ ). From these expressions we see that we need to solve R v ŵ = E[vy] or [ ][ ] [ R E[x] w E[x] T = (µ ] µ ), 0 for ŵ. From the second of these equations we can solve for w 0 as w 0 E[x] T w+w 0 = 0 w 0 = E[x] T w, (7) which we can put in the first of the equations above to get (R E[x]E[x] T )w = (µ µ ). To further evaluate this note that by expanding the quadratic and distributing the expectation we can show that Σ E[(x E[x])(x E[x]) T ] Thus from Equation 73 we see that w is given by = R E[x]E[x] T (73) w = Σ (µ µ ). When we put that expression into Equation 7 and use what we have computed for E[x] we get w 0 = 4 (µ +µ ) T Σ (µ µ ). Thus the MSE decision line is to classify a point x as a member of ω if x T w +w 0 > 0 or xt Σ (µ µ ) 4 (µ +µ ) T Σ (µ µ ) > 0, 5

53 4 3 g (x)=0 (+,, ) + class class class 3 (+,,+) (+,+, ) 0 g (x)=0 (,, ) + (,,+) 3 (,+, ) g 3 (x)=0 (,+,+) Figure 7: The decision surface produced when running the script chap 3 prob 5.m. or ( x (µ +µ )) T Σ (µ µ ) > 0. or writing it like Equation 9 we have (µ µ ) T Σ x > (µ µ ). Note that this equation is only different from Equation 9 with regard to the right-hand-side threshold. Problem 3.5 (the design of M hyperplanes) AnexamplelikeonerequestedisproducedviarunningtheMATLABscriptchap 3 prob 5.m. When this script is run we get the result shown in Figure 7. Note that no data exists in the region where the three discriminant functions are negative which is denoted (,, ). Also regions with discriminant signs like (+, +, ) exist where more than one discriminant function is positive. Problem 3.6 (using the KKT conditions) To use the Karush-Kuhn-Tucker (KKT) conditions for the given data points in Example 3.4 we assume that the points are linearly separable and use the problem formulation given in 53

54 on Page 40. Thus we start from the formulation minimize J(w) = w subject to y i (w T x i +w 0 ) for i =,,,N. The specific Lagrangian for this problem is L(w,w 0,λ) = wt w λ i (y i (w T x i +w 0 ) ) = (w +w ) λ (w +w +w 0 ) λ (w w +w 0 ) λ 3 (w w w 0 ) λ 4 (w +w w 0 ), since y i = + when x i ω and y i = when x i ω. The necessary Karush-Kuhn-Tucker conditions are then given by L w = 0 w λ λ λ 3 λ 4 = 0 (74) L w = 0 w λ +λ +λ 3 λ 4 = 0 (75) L w 0 = 0 λ λ +λ 3 +λ 4 = 0. (76) The complementary slackness conditions for the four points are given by λ (w +w +w 0 ) = 0 λ (w w +w 0 ) = 0 λ 3 ( ( w +w +w 0 ) ) = 0 λ 4 ( ( w w +w 0 ) ) = 0, with λ,λ,λ 3,λ 4 0. If we want to consider searching only for lines that pass thought the origin we take w 0 = 0 and the equations above simplify to w = λ +λ +λ 3 +λ 4 w = λ λ λ 3 +λ 4 λ λ +λ 3 +λ 4 = 0 λ (w +w ) = 0 λ (w w ) = 0 λ 3 (w w ) = 0 λ 4 (w +w ) = 0. Put the expressions just derived for w and w into the complementary slackness conditions and by changing the ordering of the equations we get λ (λ +λ 4 ) = 0 λ 4 (λ +λ 4 ) = 0 λ (λ +λ 3 ) = 0 λ 3 (λ +λ 3 ) = 0 λ λ +λ 3 +λ 4 = 0. (77) 54

55 These equations have multiple solutions, but the first two will hold true if λ +λ 4 = 0 and the second two will be true if λ +λ 3 = 0. Lets specify that these constraints are active. That is we will assume that λ +λ 4 = 0 λ = λ 4 λ +λ 3 = 0 λ = λ 3. (78) If we put these two expressions into Equation 77 we get or +λ 4 +λ 3 +λ 3 +λ 4 = 0, λ 4 +λ 3 = 0, which again has multiple solutions. If we pick λ 4 0 arbitrary then solving for λ 3 we have λ 3 = λ 4. Using Equation 78 for λ in terms of λ 3 we have λ = ( ) λ 4 = λ 4. Thus we have shown all values of λ i can be written in terms of λ 4 as λ λ λ 3 = λ 4 λ 4 λ. 4 λ 4 λ 4 Using this our weight vector w is w = = + 4 λ i y i x i ( ) λ 4 ( λ 4 [ (+) ) ( ) ] [ [ ] +λ 4 (+) ] [ +λ 4 ( ) ] = [ 0 ], the same solution quoted in the text. 55

56 Nonlinear Classifiers Notes on the text Notes on the XOR Problem x y x Figure 8: The perceptron that implements the OR gate. The perceptron that implements the OR gate is shown in Figure 8. From the given diagram we see that this perceptron gives an output value of when x +x > 0. This expression is false for the point (0,0) and is true for the points (,0), (0,), and (,). Algorithms based on exact classification of the training set We documents some notes on Meza s tiling algorithm for building a two-class neural network that exactly classifies the given input data points. In this algorithm we will be growing a network that depends on the training data rather than starting with a fixed network and then determining the parameters of the fixed network. The book does a very nice job explaining the general procedure and these are just some notes I wrote up going into more detail on the simple example given. The example points and decision lines for this section are duplicated in Figure 9. For this procedure we begin by first dividing the initial data set into two regions using one of the linear algorithms discussed in the previous chapter. For example, the pocket algorithm or a SVM like algorithm could be used to define the initial splitting of the data. This is denoted as the decision line # in Figure 9. After this line is specified, we determine that the training points are not classified with 00% certainty. We define the set X + to be the set of points on the plus side of line #. In this set there is one misclassified vector B. We define the set X as the set of points on the minus side of line #. The set X has the misclassified vector A. For any set X ± that has misclassified vectors we then recursively applytheprevious algorithmtothesesets. Intheexampleabove, wewoulddivide theset X + 56

57 3 + class + (A) class (B) + x B A x Figure 9: The sample data points and decision regions used to explain the Meza s tiling procedure for generating a network that correctly classifies the given input points. again using an algorithm that produces a separating hyperplane (like the pocket algorithm) and produce the decision line # in Figure 9. For the set X we do the same thing and produce the separating hyperplane denoted by the decision line #3 in Figure 9. The sets on the plus and minus side of X + arethen denoted as X ++ andx +. The same sets for the set X are denoted as X + and X. If any of these four new sets has misclassified vectors it will be further reduced using a separating hyperplane. This procedure is guaranteed to finish since each separating hyperplane we introduce is processing a smaller and smaller set of data. Each of these separating hyperplanes is pictorially represented as a node in the input layer horizontally located next to the master unit (or the node representing the first data split). As each hyperplane introduced is used to split misclassified data into groups with correct classification, no two data points, in different classes, can end up mapped to the same point after the first layer. Meza s algorithm next operates recursively on the set of points computed from the first layer or X where X is given by X = {y : y = f (x), for x X}, and f is the mapping implemented in the first layer. Meza has shown that this recursive procedure converges, while the book argues that since no two data points are mapped to the same vertex of a hypercube correct classification may be possible with at most three layers. Notes on backpropagation algorithm: the antisymmetric logistic For the logistic function, f(x), given by f(x) =. (79) +exp( ax) 57

58 We can show that this is equivalent to a hyperbolic tangent function with f(x) = +exp( ax) +exp( ax) +exp( ax) = exp( ax) +exp( ax) ) = eax e ax e ax +e ax = sinh(ax ) cosh( ax ) = tanh ( ax. Notes on backpropagation algorithm: the gradients This section are some simple notes that I took as I worked through the derivation of the backpropagation algorithm. Much of the discussion is of the same general form as presented in the book, but these notes helped me understand this material so I wrote them up so that they might help others. We will assume that we have a L layer network where L is given to us and fixed and we want to learn the values of the parameters that will give us the smallest mean square error (MSE) over all of the training samples. The input layer is denoted as the layer 0 and the output layer is denoted as layer L. The notation k r, for r = 0,,, L, will denote the number of nodes in the rth layer so for example, we will have k 0 nodes in the first and input layer, and k L nodes in the Lth or output layer. To denote the weights that will go into the summation at the j node in the rth layer from the nodes in the r th layer we will use the notation w r j = [ w r j0, w r j,, w r jk r ] T. (80) Where r =,,,L. The value of w0j r is the constant bias input used by the jth node in the rth layer. We will update these vectors wj r iteratively to minimize the MSE using a gradient decent scheme. Thus we will use the weight update equation where w r j (new) = wr j (old)+ wr j, (8) w r j = µ J w r j, (8) and J is the cost function we seek to minimize. For the application here the cost function J we will use will be a sum of individual errors over all of the sample points in the training set of J E(i). (83) Here E(i) is the sample error we assign to the approximate output of the Lth output layer denoted ŷ m (i). We let e m (i) denote the error in the mth component of the output our network makes when attempting to classify the sample x(i) or e m (i) y m (i) ŷ m (i) for m =,,,k L. (84) Using this definition the sample error E(i) is defined as just the mean square error of the networks mth component output due to the ith sample or ŷ m (i) against the true expected 58

59 result or E(i) k L m= e m (i) = k L m= (y m (i) ŷ m (i)). (85) At this point we are almost ready to derive the backpropagation algorithm but we will need a few more definitions. In general, a neural network works by multiplying internal node outputs by weights, summing these products, and then passing this sum through the nonlinearactivationfunctionf( ). Wecanimaginethatforeveryinput-outputpairing(x i,y i ) for i =,,,N we have a value for all of the variables just introduced. Thus our notation needs to depend on the sample index i. We do this by including this index on a variable, say X, using functional notation as in X(i). To represent the other variables we will let y r k (i) represent the output the kth neuron in the r th layer (for r =,3,,L,L+). In this notation, when r = then we have yk (i) which is the output of the first hidden layer and when r = L + we have yk L (i) which is the output from the last (or output) layer. Since there are k r nodes in the r layer we have k =,,,k r. As introduced above, the scalar wjk r is the weight leading into the jth neuron of the rth layer from the kthe neuron in the r layer. Since the rth layer has k r nodes we have j =,,,k r. Note that we assume that after sufficient training the weights will converge and there is no i index in their notational specification. As the input to the activation function for node j in layer r we need to multiply these weights with the neuron outputs from layer r and sum. We denote this result vj r (i). Thus we have now defined k r vj r (i) = k= w r jk yr kr k (i)+wj0 r = k=0 w r jk yr k (i), (86) where we take y r 0 to make the incorporation of a constant bias weight w r j0 transparent. With these definitions we are now ready to derive the backpropagation algorithm for learning neural network weights wj r. Since we assume an initial random assignment of the neural network weights we can assume that we know values for all of the variables introduced thus far for every sample. We seek to use derivative information to change these initial values for wj r intoweightsthatmaketheglobalcostfunctionj smaller. Thebackpropagationprocedure starts with the weights that feed into the last L layer, namely wj L, and works backwards updating the weights between each hidden layer until it reaches the weights at the first hidden layer wj. Once all of the weights are updated we pass every sample (x i,y i ) thought the modified network (in the forward direction) and are ready to do another backpropagation pass. To use the gradient decent algorithm we will need the derivatives of the sample error function E(i) with respect to wj, r the weights that go into the j neuron in the rth layer. Converting this derivative using the chain rule into the derivative with respect to the output vj r (i) we have E(i) = E(i) vj(i) r wj r vj r(i). (87) wj r Because of the linear nature in which v r j(i) and w r j are related via Equation 86 this second 59

60 derivative is easily computed v r j(i) w r j = v r j (i) w r j0 v r j (i) w r j. v r j (i) w r jk r = y r y r 0 (i) (i). y r k r (i) = y r (i). y r k (i) r yr (i). (88) In the above notation y r (i) is the vector of all outputs from neurons in the r st layer of the network. Notice that this value is the same for all nodes j in the rth layer. Lets define the remaining derivative above or E(i) as vj r (i) δ r j E(i) (i), (89) vj r (i) for every j in the rth layer or j =,, k r. Using these results we have w r j given by. w r j = µ J w r j = µ E(i) w r j = µ δj(i)y r r (i). (90) It is these δj r (i) we will develop a recursive relationship for and the above expression will enable us to compute the derivatives needed. Recall that we define E(i) the error in the ith sample output as in Equation 85 here expressed in terms of vm L (i) as E(i) k r m= (y m (i) ŷ m (i)) = k r m= (y m (i) f(v L m (i))). (9) The output layer: In this case r = L and we are at the last layer (the output layer) and we want to evaluate δj L E(i) (i) = when E(i) is given by Equation 9. From that expression vj L (i) we see that the vj L derivative selects only the jth element from the sum and we get E(i) v L j (i) = ()(y j(i) f(v L j (i)))f (v L j (i)) where we have defined the error e j (i) as in Equation 84. = e j (i)f (vj L (i)). (9) The hidden layers: In this case r < L and the influence of v r j on E(i) comes indirectly through its influence on vj r. Thus using the chain rule to introduce this variable we have E(i) v r j (i) = k r k= E(i) v r k (i) v r k (i) v r j (i). (93) On using the definition of δ r j(i) given by Equation 89 in both side of this expression we have δ r j (i) = k r k= δ r k (i) vr k (i) v r j (i). (94) 60

61 We stop here to note that this is a expression for the previous layer s δ r j showing how to compute it given values of the current layer s δj r. To fully evaluate that we need to compute vk r(i) v r j (i). Using Equation 86 we find vk r(i) v r j (i) = j v r (i) [ kr m=0 w r km yr m (i) ] = v r j (i) [ kr m=0 ] wkm r f(vr m (i)). This derivative again selects the m = jth term in the above sum and we find vk r(i) vj r (i) = wr kjf (v r j (i)). (95) Thus the recursive propagation of δ r j(i) the then given by using Equation 94 with the above derivative where we find with ê r j defined by δ r j (i) = k r k= δ r k (i)wr kj f (v r j k r (i)) = f (v r (i)) δk r (i)wr kj (96) j = ê r j f (v r j (i)), (97) ê r j = k r k= k= δ r k(i)w r kj. (98) When our activation function f(x) is the sigmoid function defined by f(x) =, (99) +e ax we find its derivative given by [ ] f ( a) e ax +e ax (x) = = f(x)(a) = af(x) (+e ax ) e ax +e ax +e ax = af(x)( f(x)). (00) With all of these pieces we are ready to specify the backpropagation algorithm. The backpropagation algorithm Initialization: We randomly initialize all weights in our network wkm r for all layers r =,, L and for all internal nodes where the index k selects a node from the layer r thus k =,, k r and the index m selects a node from the layer r and thus m =,, k r. Forward propagation: Once the weights are assigned values for each sample (x i,y i ) for i =,, N we can evaluate using forward propagation the variables vj r (i) using Equation 86 and then yj(i) r via f(vj(i)). r We can also evaluate the individual errors E(i) using Equation 85 and the total error J using Equation 83 6

62 Then starting at the last layer where r = L for each sample i =,, N and for each neuron j =,, k L first compute δj L using Equation 9. Then working backwards compute δj r using Equation 97 and 98. Since we know δj L we can do this for r = L,L,, and for each neuron in layer r so j =,, k r. Once we have δj r computed we can now update the weights wj r using Equation 8 and 8 with the explicit for form wj r given by Equation 90. Once we have updated our weight vectors we are ready to apply the same procedure again, i.e. issuing a forward propagation followed by a backwards propagation sweep. Notes on variations on the backpropagation algorithm Consider the backpropagation weight update equation with the momentum factor αwj(t ) r given by wj r (t) = α wr j (t ) µg(t). (0) By writing out this recurrence relationship for t = T,T,T, as w r j(t) = α w r j(t ) µg(t) = α [ α w r j (T ) µg(t )] µg(t) = α wj r (T ) µ[αg(t )+g(t)] = α [ α wj r (T 3) µg(t )] µ[αg(t )+g(t)] = α 3 wj(t r 3) µ [ α g(t )+αg(t )+g(t) ]. T = α T wj(0) µ r α t g(t t). (0) t=0 As we require α < then α T wj(0) r 0 as T +. If we assume that our gradient vector is a constant across time or g(t t) g then we see that wj r (T) can be approximated as ( ) wj r (T) µ[ +α+α +α + ]g 3 = µ g. α Problem Solutions Problem 4. (a simple multilayer perceptron) The given points for this problem are plotted in the Figure 0 and is generated with the MATLABscriptchap 4 prob.m.. Fromthatfigureweseethatthesepointsareascattering of points around the XOR pattern, which we know are not linearly separable, these points are also not be linearly separable. We can however separate the two classes in the same way 6

63 + (y,y )=(,) class class.5 g (x,x )=0 x 0.5 (y,y )=(0,) g (x,x )= (y,y )=(0,0) x Figure 0: The data points from Problem 4. and the two linear discriminant functions g (x,x ) = 0 and g (x,x ) = 0, that when combined, separate the two classes. we did for the XOR pattern. If we introduce two discriminate lines g (x,x ) and g (x,x ) given by g (x,x ) = x +x.65 = 0 and g (x,x ) = x +x 0.65 = 0. Next we introduce threshold variables, y i, that are mappings of the values taken by g i when evaluated at a particular pair (x,x ). For example, y i = 0 if g i < 0 and y i = if g i 0. Then we see that the entire (x,x ) space has been mapped to one of three (y,y ) points: (0,0), (0,), and (,) depending on where the point (x,x ) falls relative to the two lines g = 0 and g = 0. Under the (y,y ) mapping just discussed, only the point (0,) is associated with the second class, while the other two points, (0,0) and (,), are associated with the first class. Given the output (y,y ) our task is now to design a linear hyperplane that will separate the point (0,) from the two points (0,0) and (,). A line that does this is y y = 0. Problem 4. (using a neural net to classify the XOR problem) See the R script chap 4 prob.r where this problem is worked. When that script is run it produces the plots shown in Figure. In that figure we see that after very few iterations the neural network is able to classify the training data almost perfectly. The error on the testing data set decreases initially and then increases as the net overfits and learns information that is not generalizable. The degree to which this is over fitting takes place does not really appear to be that great however. 63

64 x y x y 0.65 Figure : The network suggested by Problem 4.. x_ error_curve_training training error rate testing error rate x_ :max_iterations Figure : Left: Classification of the XOR problem. Right: The training and testing error rates as a function of iteration number. Problem 4.3 (multilayer perceptron based on cube vertexes) Part (a): The given decision regions for this problem are drawn in Figure 3, which is generated with the MATLAB script chap 4 prob 3.m. The vertex to which each region is mapped is specified using the notation (±,±,±), where a minus is mapped to 0 and a plus is mapped to a. From the discussion in the book a two layer neural network can classify unions of polyhedrons but not unions of unions. Thus if consider class ω to be composed of the points in the regions (,, ), (+,, ), (,+, ), and (+,+, ). While the class ω is composed of points taken from the other polyhedrons. Then the ω polyhedrons map to the four points on the (y,y,y 3 ) hypercube (0,0,0), (,0,0), (0,,0), and (,,0), while the other polyhedrons map to the upper four points of this hypercube. These two sets of points in the mapped (y,y,y 3 ) space can be separated easily with the hyperplane y 3 =. Thus we can implement the desired classifier in this case using the two-layer neural network 64

65 .5 x + x = 0 x x = 0 (,+, ) (+,+, ) (+,+,+) x x = /4 (+,, ) 0.5 (+,,+) (,, ) (,,+) x Figure 3: The given decision regions specified for Problem 4.3. shown in Figure 4. x x 0 0 y y 0.5 y Figure 4: The two layer network for Problem 4.3. Part (b): To require a three node network it is sufficient to have the mapped classes in the (y,y,y 3 ) space mapped to the XOR problem on the unit hypercube. Thus if we pick the points in the polyhedrons (,, ) and (+,+,+) to be members of class ω and the points in the other polyhedrons to be from class ω we will require a three layer network to perform classification. In that case we can use an additional layer (the second hidden layer) to further perform the classification. The resulting neural network is given in Figure 5. In that figure we have denoted the output of the two neurons in the second hidden layer as z and z. To determine the weights to put on the neurons that feed from the first hidden layer into the second hidden layer in Figure 5 since in the (y,y,y 3 ) space this is the XOR problem and we can solve it in the same way that we did in the earlier part of this chapter. That is we will create two planes, one that cuts off the vertex (0,0,0) from the other 65

66 x 0 0 y z y x /4 y 3 z 0 / 0 5/ Figure 5: The three layer network for Problem 4.3. vertexes of the hypercube and the second plane that cuts off the node (,,) from the other vertexes of the hypercube. The points in between these two planes will belong to one class and the points outside of these two planes will belong to the other class. For the first plane and the one that cuts off the vertex (0,0,0) of the many possible one plane that does this is the one that passes through the three points (/,0,0), (0,/,0), (0,0,/). While for the second plane and the one that cuts off the vertex (,,) of the many possible planes that do this one plane that works is the one that passes through the three points (,,/), (/,,), (,/,). We thus need to be able obtain the equation for a plane in three space that passes through three points. This is discussed in [9] where it is shown that the equation of a plane c x+c y +c 3 z +c 4 = 0, that must pass thought the three points (x,y,z ), (x,y,z ), and (x 3,y 3,z 3 ) is given by evaluating x y z x y z x y z = 0. x 3 y 3 z 3 As an example of this, for the first plane we need to evaluate x y z / 0 0 / 0 0 = 0, / 0 0 or x 0 0 / 0 0 / 66 y z / 0 0 / = 0,

67 or finally x+y +z = 0. The same procedure for the second plane gives x+y +z 5 = 0. Thus with the above planes we have computed the weights feeding into the second hidden layer. To finish the problem we recognized that with the first plane, only the single point (y,y,y 3 ) = (0,0,0)ismappedtothevalueof0whileallotherpointsaremappedto. With the second plane, only the single point (y,y,y 3 ) = (,,) is mapped to the value of while all other points are mapped to 0. Thus when we threshold on the sign of the values of the two discriminant mappings above we see map the points (0,0,0) (0,0), (,,) (0,), and all other (y,y,y 3 ) points are mapped to (,0). To finish our classification we need to find a hyperplane that splits the two points (0,0) and (0,) from (,0). Such a discriminant surface is z =, where we assume the second hidden layer maps the points (y,y,y 3 ) to the point (z,z ). This final discrimination surface is also represented in Figure 5. Problem 4.4 (separating the points x and x with a hyperplane) First recall that the difference vector x x is a vector from the vector x and pointing to the vector x, since if we add the vector x to this difference vector we get x i.e. x +(x x ) = x. The midpoint between the two points x and x is the vector (x +x ). Thus this problem asks to find the plane with a normal vector proportional to x x and passing through the point x 0 (x +x ). This means that if we take x to be a vector in the hyperplane then the vector x x 0 must be orthogonal to x x or have a zero dot product (x x ) T (x x 0 ) = 0. Using the definition for x 0 we have this expression is equal to or (x x ) T x (x x ) T (x +x ) = 0, (x x ) T x (xt x x T x ) = (x x ) T x x + x = 0. It remains to show that x is on the positive side of the hyperplane. To show this consider the above expression evaluated at x = x. We find (x x ) T x x + x = x x T x x + x = x x T x + x = ( x x T x + x ) = (x x ) T (x x ) = x x, which is positive showing that x is on the positive side of the above hyperplane. 67

68 Problem 4.6 (backpropagation with cross-entropy) The cross entropy 4.33 is J = k L k= Thus we see that E(i) in this case is given by E(i) = k L k= y k (i)ln y k (i)ln Thus we can evaluate δj L (i) as [ δj L E(i) (i) vj L(i) = vj L(i) k L k= ) (ŷk (i). y k (i) ) (ŷk (i). y k (i) y k (i)ln ( ) ] f(v L k ). y k (i) This derivative will select the k = jth element out of the sum and gives ( ( )) δj L (i) = y f(v L j ) j(i) vj L(i) ln = y j (i) f (vj L) y k (i) f(vj L). If the activation function f( ) is the sigmoid function Equation 99 then its derivative is given in Equation 00 where we have f (vj L) = af(vl j )( f(vl j )) and the above becomes δ L j (i) = ay j (i)( f(y L j )) = ay j (i)( ŷ j (i)). Problem 4.7 (backpropagation with softmax) The softmax activation function has its output ŷ k given by ŷ k exp(vl k ) k exp(v L k ). (03) Note that this expression depends on v L j in both the numerator and the denominator. Using the result from the previous exercise we find δ L j (i) E(i) v L j (i) = v L j (i) ( = v L j (i) = y j(i) ŷ j k L k= ( y j (i)ln ( ) ) ŷk y k (i)ln y k (i) ( )) ŷj y j (i) kl ŷ j vj L(i) k=;k j y k (i) ŷ k 68 v L j (i) ŷ k vj L(i). ( kl k=;k j ( ) ) ŷk y k (i)ln y k (i)

69 To evaluate this we first consider the first term or where we find vj L (i) ( ) ŷ j exp(v L j ) vj L(i) = vj L(i) k exp(v L k ) = ŷ j exp(vj L ) k exp(v L k ) exp(vl j )exp(vl j ) ( k exp(v L k )) = ŷ j ŷj. While for the second term we get (note that j k) ŷ k vj L(i) = ( ) exp(v L k ) = exp(vl k )exp(vl j ) vj L k exp(v L k ) ( k exp(v L k )) = ŷ kŷ j. Thus we find δ L j = y j(i) ŷ j (ŷ j ŷ j ) kl k=;k j = y j (i)( ŷ j )+ŷ j k L k=;k j y k (i) ŷ k ( ŷ k ŷ j ) y k (i). Since ŷ k (i) and y k (i) are probabilities of class membership we have k L k= y k (i) =, and thus k L k=;k j y k (i) = y j (i). Using this we find for δj L (i) that the expression we were to show. δ L j (i) = y j(i)+y j (i)ŷ j +ŷ j ( y j ) = ŷ j y j (i), Problem 4.9 (the maximum number of polyhedral regions) The books equation 4.37 is M = l m=0 ( k m ) with ( k m ) = 0 if m > k. (04) where M is the maximum number of polyhedral regions possible for a neural network with one hidden layer containing k neurons and an input feature dimension of l. Assuming that l k then l ( ) k k ( ) k M = = = k, m m m=0 m=0 ( ) k where we have used the fact that = 0 to drop all terms in the sum when m = m k +,k +,,l if there are any. That the sum of the binomial coefficients sums to k follows from expanding (+) k using the binomial theorem. 69

70 Problem 4. (an approximation of J ) wkj r wr k j For J given by J = k L m= (ŷ m (i) y m (i)). We will evaluate the second derivatives of this expression. First we take the wkj r derivative of J directly and find J k L = (ŷ m (i) y m (i)) ŷ m(i). w r kj m= Next we take the wk r j derivative of this expression. We find w r kj J = wkj r wr k j + k L m= k L m= ŷ m (i) ŷ m (i) wk r j w r kj (ŷ m (i) y m (i)) ŷ m (i) wkj r wr k j. If we are near a minimum of the objective function we can assume that ŷ m (i) y m (i) 0 and can thus approximate the above derivative as J = wkj r wr k j k L m= ŷ m (i) ŷ m (i), wkj r wk r j showing that we can approximate the second derivative by products of the first order ones. Recall that the variable wkj r represent the weight from neuron k in layer r to the neuron j in layer r. We would expect that the effect of changes in wkj r on the output ŷ m(i) would be propagated though the variables vj r (i). From the chain rule we have ŷ m (i) w r kj = ŷ m(i) vj r vj r(i). wkj r Using Equation 88 we see that vr j w r kj = y r k and thus we get ŷ m (i) w r kj = ŷ m(i) vj r(i)yr k, which if we define ŷm(i) v r j (i) ˆδ r jm is the expression we wanted to show. Problem 4.3 (approximating the Hessian) We will assume that the weight notation for this problem is the same as in the book where by the expression wjk r is the weight from the neuron k in the r st layer into the neuron j in the r-th layer. 70

71 Using Equation 87 and Equation 88 from the chain rule we have (dropping the i dependence) ( ) E ( w = vr j E y r jk r ) wjk r vj r vj r k. Since the input y r k Using Equation 9 we get Now to evaluate E Thus the v r j is independent of v r j we get E = ( wjk r (yr k ) E. (05) ) ( vj r ) E ( v L j ) = f (v L j )e j ( v r j ) [ v L j = f (v L j )e j +f (v L j ). (y j f(v L j )) ]f (v L j ) we note that from Equation 96 and Equation 89 we have [ kr ] E vj r = δkw r kj r f (vj), r k= derivative of this is given by E ( v r j ) = f (v r k r j ) wkj r k= δk r v r j δk r v r j +f (v r j ) ] δkw r kj r. Thus we need to evaluate. To do this we will use the definition of δk r given by Equation 89, an expression like Equation 93 and subsequent developments following that equation, namely ( ) E k r ( ) E v r vj r = k vk r vk r v r k vj r k = k r j ) wk r j k = = f (v r [ kr k= E. vk r v r k Dropping all off-diagonal terms in the summation above we keep only the k = k element and find δk r v r = f (v r j )w r E kj j ( v. k r) Using this we finally get for E the following ( v r ) E ( v r j ) = (f (v r j k r j )) k= (w r kj ) E ( v r k ) +f (v r j ) [ kr k= δ r k wr kj Note that this expression is different than that given in the book in that the first term in the book has a summation with an argument of E (note the j subscript) rather ( vj r) than E (with a k subscript). Since the first of these two expressions is independent ( vk r) of k it could be taken out of the summation making me think the book has a typo in its equation. Please contact me if anyone sees any errors in this derivation. 7 ].

72 Problem 4.5 (a different activation function) When one changes the activation function in the backpropagation algorithm what changes is the function we use to evaluate any expression with f( ) or f ( ), for example in Equations 9 and 97. One of the nice things about the backpropagation algorithm is that calls to the activation function f and its derivative f can simply be viewed as algorithmic subroutines that can be replaced and modified if needed. For the suggested hyperbolic tangent function f(x) given by f(x) = ctanh(bx), (06) we have its derivative given by f (x) = cbsech (bx). From the identity cosh (x) sinh (x) =, by dividing by cosh (x) we can conclude that sech (x) = tanh (x) and thus f (x) = cb( tanh (bx)) ) = b f(x). (07) c These two functions then need to be implemented to use this activation function. Problem 4.6 (an iteration dependent learning parameter µ) A Taylor expansion of + t t 0 or + t t + t +. t 0 t 0 t 0 Thus when t t 0 the fraction to leading order and thus µ µ + t 0. On the other t 0 hand when t t 0 we have that + t t 0 t t 0 and the fraction above is given by + t t 0 t t 0 = t 0 t. Thus in this stage of the iterations µ(t) decreases inversely in proportion to t. Problem 4.7 (using a neural net as a function approximation) Thisproblemisworked intherscriptchap 4 prob 7.R. Whenthatscript isrunitproduces a plot like that shown in Figure 6. The neural network with two hidden nodes was created using the nnet command from the nnet package. We see that the neural network in this case does a very good job approximating the true function. 7

73 y training data testing data truth x Figure 6: The function to fit and its neural network approximation for Problem 4.7. Problem 4.9 (when N = (l+) the number of dichotomies is N ) We have O(N,l) = l i=0 ( N i ), (08) where N is the number of points embedded in a space of dimension l and O(N,l) is the number of groupings that can be formed by hyperplanes in R l to separate the points into two classes. If N = (l+) then O((l+),l) = l i=0 ( l + i ). Given the identity ( n+ n i+ ) = ( n+ n+i ), (09) 73

74 by taking i = n+,n,n,, we get the following equivalences ( ) ( ) n+ n+ = 0 n+ ( ) ( ) n+ n+ = n ( ) ( ) n+ n+ = n ( n+ n ( n+ n ) ). = = ( n+ n+ ( n+ n+ ) ) Now write O((l+),l) as l i=0 ( l+ i ) + l i=0 ( l+ i or two sums of the same thing. Next note that using the above identities we can write the second sum as l ( ) ( ) ( ) ( ) ( ) l+ l + l+ l + l+ = i 0 l l i=0 ( ) ( ) ( ) ( ) l + l+ l + l+ = l + l l+ l+ l+ ( ) l+ =. i i=l+ Thus using this expression we have that O((l+),l) = l i=0 ( l+ i ) + l+ i=l+ ( l+ i ), ) l+ ( l+ = i Since l+ = N we have that O((l+),l) = N as we were to show. i=0 ) = l+. Problem 4. (the kernel trick) From the given mapping φ(x) we have that yi T y j = φ(x i ) T φ(x j ) = +cos(x i)cos(x j )+cos(x i )cos(x j )+ +cos(kx i )cos(kx j ) + sin(x i )sin(x j )+sin(x i )sin(x j )+ +sin(kx i )sin(kx j ). 74

75 Since cos(α)cos(β)+sin(α)sin(β) = cos(α β) we can match cosigns with sines in the above expression and simplify a bit to get y T i y j = +cos(x i x j )+cos((x i x j ))+ +cos(k(x i x j )). To evaluate this sum we note that by writing the cosigns above in terms of their exponential representation and using the geometric series we can show that ) ) α +cos(α)+cos(α)+cos(3α)+ +cos(nα) = sin(( n+ sin ( ). (0) x Thus using this we can show that yi T y j is given by sin (( ) k + (xi x j ) ) sin ( ), x as we were to show. 75

76 Feature Selection Notes on the text Notes on Data Normalization First recall that for small y we have e y y + y +, thus ˆx ik = Next recall that for small v we have y + y 4 ( ) +e y + y + = y y + y which is a linear function of y as claimed. = v k=0 vk thus we get ( ) ( ) y y y y = + y y 4 y 4 y3 4 + = + y.. Notes on the Unknown Variance Case Consider the expectation E[(x i µ+µ x) ] = E[(x i µ) +(x i µ)(µ x)+(µ x) ] We can evaluate the inner expectation using ( E[(x i µ)(µ x)] = E[(x i µ) N = N = σ +E[(x i µ)(µ x)]+ σ. µ N i = i = x i E[(x i µ)(x i µ)] = N E[(x i µ) ], i = since by independence E[(x i µ)(x j µ)] = 0 if i j. Since E[(x i µ) ] = σ we get E[ˆσ ] = N = N N ( σ ( N ( σ N ) σ = σ. ) ) )+ σ N ] 76

77 Notes on Example 5.3 SeetheRscriptchap 5 example 5 3.Rforcodethatduplicatestheresultsfromthisexample. Notes on the derivation of the divergence for Gaussian distributions When the conditional densities are Gaussian we have p(x ω i ) N(µ i,σ i ) p(x ω j ) N(µ j,σ j ). Then to compute the divergence d ij given by ( ) p(x ωi ) d ij = (p(x ω i ) p(x ω j ))ln dx, () p(x ω j ) ( ) we first need to compute the log term ln p(x ωi ) p(x ω j, where we find ) ln ( ) p(x ωi ) = [ (x µi ) T Σ i (x µ i ) (x µ j ) T Σ j (x µ j ) ] + ( ) p(x ω j ) ln Σj. Σ i When we expand the quadratics above we get ( ) p(x ωi ) ln = p(x ω j ) [ x T Σ i x x T Σ j x µ T i Σ i x+µ T j Σ j x ] [ µ T i Σ i µ i µ T j Σ j µ j ] + ln ( Σj Σ i Only the first four terms depend on x while the remaining terms are independent of x and can be represented by a constant C. Because the densities p(x ) are normalized we note that (p(x ω i ) p(x ω j ))Cdx = C( ) = 0, and these terms do not affect the divergence. Thus we only need to worry about how to integrate the first four terms. To do these lets first consider the integral of these terms against p(x ω i ) (integrating against p(x ω j ) will be similar). To do these integral we will use Equation 94 from Appendix A to evaluate the integral of the terms like x T Σ x, against p(x ω i ). When we do that we find the integral of the log ratio term expressed above is given by (multiplied by /) ( ) p(x ωi ) ln p(x ω i )dx = µ T i Σ i µ i +trace(σ i Σ i ) µ T i Σ j µ i trace(σ i Σ j ) p(x ω j ) ). µ T i Σ i µ i +µ T j Σ j µ i = µ T i Σ i µ i µ T i Σ j µ i +µ T j Σ j µ i + trace(i) trace(σ i Σ j ). 77

78 In the same way the integral of the log ratio term against p(x ω j ) is given by ( ) p(x ωi ) ln p(x ω j )dx = µ T j p(x ω j ) Σ i µ j trace(σ j Σ i )+µ T j Σ j µ j +trace(σ j Σ j ) + µ T i Σ i µ j µ T j Σ j µ j = µ T j Σ j µ j µ T j Σ i µ j +µ T i Σ i µ j + trace(i) trace(σ j Σ i ). If we take of the first and second expression and add them together we get two types of terms. Terms involving the trace operation and terms that don t depend on the trace. The trace terms add to give trace terms = trace(i)+trace(σ i Σ j ) trace(i)+trace(σ j Σ i ) = trace(i)+trace(σ i Σ j )+trace(σ j Σ i ). The non-trace terms add together to give non-trace terms = µ T i Σ i µ i +µ T i Σ j µ i µ T j Σ j µ i + µ T j Σ j µ j +µ T j Σ i µ j µ T i Σ i µ j = µ T i (Σ i µ i +Σ j µ i Σ j µ j Σ i µ j )+µ T j (Σ i µ j +Σ j µ j ) = µ T i ((Σ i +Σ j )µ i (Σ i +Σ j )µ j )+µ T j (Σ i +Σ j )µ j = µ T i (Σ i +Σ j )(µ i µ j ) µ T i (Σ i +Σ j )µ j +µ T j (Σ i +Σ j )µ j = µ T i (Σ i +Σ j )(µ i µ j ) (µ T i µ T j )(Σ i +Σ j )µ j = (µ i µ j ) T (Σ i +Σ j )µ i (µ i µ j ) T (Σ i +Σ j )µ j = (µ i µ j )(Σ i +Σ j )(µ i µ j ). In total when we divide by and add together the trace and the non-trace expressions we get d ij = (µ i µ j )(Σ i +Σ j )(µ i µ j )+ trace(σ iσ j +Σ j Σ i I), () for the expression for the divergence between two Gaussians. Notes on Example 5.4 From the expression for B derived in the book B = l log ( σ +σ σ σ ), if we consider r = σ σ and put in σ = rσ in to the above we get B = l ( ) +r log +, r as r 0. Thus P e P(ω )P(ω )e B 0. 78

79 Notes on scatter matrices Recall the definitions of the within-class scatter matrix, S w, given by S w = M P i E[(x µ i )(x µ i ) T x ω i ], (3) and the definition of the between-class scatter matrix, S b, given by S b = M P i (µ i µ 0 )(µ i µ 0 ) T, (4) The variable µ 0 is the global mean vector or mean over all features independent of class. We can show that this is equivalent to the expression given in the book as follows µ 0 = N x i = N M n i x j = N j= M n i µ i = M P i µ i, (5) Where in the second sum above we mean to sum only over those features x that are members of class i. Here M is the number of classes. Note that means either µ i (the class specific means) and µ 0 (the global mean) are linear functions of the raw features x. Thus if we consider a linear transformation of x such as y = A T x, then x means denoted by µ x transform into y means in the expected manner µ y = A T µ x. From the definitions of the x based scatter matrices S wx and S bx given above and how the x based µ s change under a linear transformation we see that S yw = A T S xw A and S yb = A T S xb A. (6) We will use these two results when we pick the transformation A T so that the transformed vectors y are optimal in some way. If we consider a single feature (a scalar) in the two class case where was assume with equal class probabilities we have for S w and S b the following expressions S w = (S +S ) = (σ +σ ) S b = ((µ µ 0 ) +(µ µ 0 ) ). Since µ 0 = (µ +µ ) we find the differences needed in the expression for S b given by µ µ 0 = (µ µ ) and µ µ 0 = (µ µ ), 79

80 and we have S b = (µ µ ). Thus J = S b S w (µ µ ) (σ +σ ). This later expression is known as Fisher s discriminant ration or FDR. The book give a multidimensional generalization of FDR md = M M j i (µ i µ j ) σ i +σ j, but I would think that one would want to incorporate the class priors as was done in the multiclass generalization of the divergence via FDR md = M M j i (µ i µ j ) P(ω σi i )P(ω j ). +σ j Notes on sequential backwards selection (SBS) Here we derive the number of times we evaluate the class separability metric when using sequential backwards selection to find a suboptimal collection of l features. We start with the initial m features, and begin by evaluating the class separability measure J( ) using the full m dimensional feature vector. This results in evaluation of J. We then sequentially remove each of the m features from the full feature set and evaluate J on using each reduced vector. This requires m evaluations of J. We select the set of features of size m that gives the largest value of J. Using this selection of variables we continue this process by sequentially removing each variable to obtain a set of m vectors each of dimension m and evaluate J on each. This requires m evaluations. Thus now we have performed +m+m, evaluations of J to select the vector of size m. If we continue this pattern one more step we will do +m+m +m, evaluations of J to select the optimal set of features of size m 3. Generalizing this we need +m+m +m + +l +, evaluations of J to select the optimal set of features of size l. This can be simplified as + m k=l+ k = + m k k= l k k= = + m(m+) l(l +). A simple python implementation of this search procedure is given in backwards selection.py and backwards selection run best subsets.py. 80

81 Notes on sequential forwards selection (SFS) In the same way as in sequential backwards selection we start by performing m evaluations of J to pick the best single feature. After this feature is selected we then need to evaluate J, m times to pick the set of two features that is best. After the best two features are picked we need to evaluate J m time to pick the best set of three features. This process continues until we have selected l features. Thus we have in total m k=m (l ) k = m k k= m l k= = m(m+) (m l)(m l+) = lm l(l ), when we simplify. A simple python implementation of this search procedure is given in forward selection.py and forward selection run best subsets.py. Notes on optimal feature generation For J 3 defined as trace(s w S m ) when we perform a linear transformation of the raw input feature vectors x as y = A T x, the scatter matrices transform as given by Equation 6 or S yw = A T S xw A and S ym = A T S xm A the objective J 3 as a function of A becomes J 3 (A) = trace((a T S xw A) (A T S xm A)). (7) We would like to pick the value of A such that when we map the input features y under A T the value of J 3 (A) is maximal. Taking the derivative of the above and using the results on Page 96 we get that the equation J 3(A) A = 0 imply (when we postmultiply by AT S xw A) S xw A(A T S xw A) (A T S xb A) = S xb A. But because the transformed scatter matrices S yw and S yb are given by A T S xw A and A T S xb A respectively by using these expressions and premultipling the above by Sxw, we can write the above expression as ASyw S yb = Sxw S xba. (8) Note that this expression has scatter matrices in terms of y on the left and x on the right. Whenwritten inthisform, thisexpression hides theamatrices inthedefinition ofs yw and S yb. Since we don t know A we can t directly compute the matrices S yw and S yb. Assuming for the moment that we could compute these two matrices, since they are both symmetric we can find an invertible matrix B such that diagonalizes both S yw and S yb. This means that there is an invertible linear transformation B such that B T S yw B = I and B T S yb B = D, 8

82 where D is a diagonal matrix. This means that in terms of B and D we can write Syw and S yb as Syw = (B T B ) = BB T and S yb = B T DB. If we use these expressions after we postmultiply Equation 8 by B we find (S xw S xb)ab = AS yw S ybb = ABB T B T DB B = ABD. If we let C AB we have Sxw S xbc = CD. (9) This is an eigenvalue problem where columns of C are the eigenvectors of Sxw S xb and D is a diagonal matrix with the eigenvectors on the diagonal. To complete this discussion we now need to decide which of the eigenvectors of SxwS xb we are going to select as the columns of C. In an M class problem the rank of S xb is at most M. Thus the rank of the product matrix Sxw S xb is at most M. Thus we can have at most M non-zero eigenvalues and thus there can be at most M associated eigenvectors. Question: These eigenvalues are positive, but I currently don t see an argument why that needs to be so. If anyone knows of one please let me know. Since we are asked to take the m original features from the vector x and optimally (with respect to J 3 ) linearly transform them into a smaller set l features the largest l can be is M. If l = M then we should take C to have columns represented by all of the non-zero eigenvectors of SxwS xb. This will have the same maximal value for J 3 in that in the original space of x J 3 has the value J 3,x = trace(s xws xb ) = λ +λ + +λ M, since a matrix trace is equivalent to the sum of that matrices eigenvalues. While after performing the C T transformation on x or ŷ = C T x we have J 3 given by Equation 7 or J 3,ŷ = trace((c T S xw C) (C T S xb C)). To evaluate this expression recall that C is the matrix in Equation 9 or S xb C = S xw CD. If we premultiply this by C T we get so Thus C T S xb C = C T S xw CD, (C T S xw C) (C T S xb C) = D. trace((c T S xw C) (C T S xb C)) = trace(d) = λ +λ + +λ M, the same as J 3,x obtained earlier. If l < M then we take the l eigenvectors associated with the l largest eigenvalues of S xw S xb. Note that B is not necessarily orthogonal, all we know is that it is invertible. 8

83 Problem Solutions Problem 5. ( (N )s z σ is a chi-squared random variable) To solve this problem we will use several results from Appendix A. (chi-squared distribution). To begin recall that when given N draws, {x i } N, from a Gaussian random variable with variance σ and sample mean x the expression σ (x i x), is given by a chi-squared distribution with N degrees of freedom. Next recall that if χ and χ are independent random variables from chi-squared distributions with N and N degrees of freedom then χ = χ +χ, is a random variable from a chi-squared distribution with N +N degrees of freedom. Thus when we consider the expression (N )s z or σ (x i x) σ + (y i ȳ) σ, we have the sum of two independent chi-squared random variables each of degree N. Thus this expression is another chi-squared random variable with N degrees of freedom. Problem 5. (q has a t-distribution) Using the same arguments as in problem 5. above we first note that (N +N )s z σ, is given by a χ random variable with N +N degrees of freedom. Next, if we consider the random variable x ȳ µ +µ we know that it is Gaussian with a zero mean and a variance given by Var[ x ȳ µ +µ ] = Var[(( x µ ) (ȳ µ ))] = Var[ x µ ] Cov[( x µ ),(ȳ µ )]+Var[ȳ µ ] = Var[ x µ ]+Var[ȳ µ ] = σ N + σ N, since we are assuming that Cov[( x µ ),(ȳ µ )] = 0. Thus the random variable x ȳ µ +µ, σ N + N 83

84 is a standard normal. Then using the results from Appendix A. (t-distribution) where it is stated that the ratio of a standard normal random variable over a scaled χ random variable with n degrees of freedom is a t-distributed random variable of degree n we have, forming a ratio of the required form, that x ȳ µ +µ σ N + N ( ), (N +N )s z σ N +N is a t distributed random variable with N +N degrees of freedom. The above expression simplifies to x ȳ µ +µ s z. N + N Which shows that the desired expression is a t distributed random variable with N +N degrees of freedom. Problem 5.3 (A is orthonormal) The given matrix A has components A(i,j) that can be represented as A(,j) = n j n A(i,i) = i i(i ) i A(i,j) = i and j i. i(i ) Then the (p,q) element of the product AA T is given by (AA T )(p,q) = n A(p,k)A T (k,q) = k= n A(p,k)A(q,k). k= We will evaluate this expression for all possible values of (p,q) and show that in all cases this matrix product equals the identity matrix. Since the first row above seems different than the general case we start there. If p = then we have If we then take q = we get (AA T )(,q) = n A(q,k). n k= (AA T )(,) = n A(,k) = n n k= n =. k= 84

85 If q > then we get ] (AA T )(,q) = n A(q,k) = q [A(q,q)+ A(q, k) n n k= k= k= [ ] = q q = 0. n q(q ) q(q ) Now assume that p >. Then we have for (AA T )(p,q) the following (AA T )(p,q) = = p n A(p,k)A(q,k) = A(p,p)A(q,p)+ A(p, k)a(q, k) k= p p A(q,p) A(q,k). (0) p(p ) p(p ) k= k= To evaluate this lets first assume that q < p then A(q,k) = 0 if k > q and then Equation 0 gives (AA T q )(p,q) = 0 A(q, k) p(p ) k= = [A(q,q)+ p(p ) ] q A(q, k) k= [ q = p(p ) q(q ) If q > p then Equation 0 gives ( p ) p(p ) q(q ) q k= p p(p ) Finally, if p = q then Equation 0 gives ( (AA T p )(p,p) = ) p(p ) q(q ) = k= q(q ) ] = 0. q(q ) = 0. p p(p ) k= p(p ) q(q ) [(p )(q )+(p )] q(q ) = (p )q p(p ) q(q ) =, when we convert all q s into p s. Thus we have shown that AA T = I and A is an orthogonal matrix. 85

86 Problem 5.4 (linear combinations of Gaussian random variables) Recall [] that the characteristic function for multidimensional Gaussian random vector x with mean µ and covariance Σ is given by ( ζ X (t) = E[e ittx ] = exp it T µ ) tt Σt. () If our random vector y is a linear combination of the elements of the vector x then y = Ax and the characteristic function for y is given by ζ Y (t) = E[e itty ] = E[e ittax ] = E[e i(at t) Tx ] = ζ X (A T t) ( = exp it T Aµ ) tt AΣA T t, which is the same as the characteristic function of a multidimensional Gaussian random vector that has a mean vector of Aµ and covariance matrix of AΣA T as we were to show. If x i aremutuallyindependentwithidenticalvariancessayσ thenσisamultipleoftheidentity matrix, say Σ = σ I. In that case AΣA T = σ AA T. In that case if A is orthogonal the covariancematrixfory i isσ I andthesetransformedvariablesarealsomutuallyindependent. Problem 5.5 (the ambiguity function) We define the ambiguity function as A = M K P( j )P(ω i j )log M (P(ω i j )). () j= If the distribution of features over each class is completely overlapping, then P( j ω i ) is independent of ω i. That is P( j ω i ) = P( j ). In this case, then using Bayes rule we have P(ω i j ) = P( j ω i )P(ω i ) P( j ) = P(ω i ). The ambiguity function in this case then becomes A = M K M P( j )P(ω i )log M (P(ω i )) = P(ω i )log M (P(ω i )). j= If we further assume that each class is equally likely then P(ω i ) =, so log M M and we find A becomes A = M =. M ( M ) = If the distribution of features are perfectly separated, then P(ω i j ) = 0 if class i does not have any overlap with the region j, otherwise P(ω i j ) =, since in that case only class 86

87 ω i is present. To evaluate A we break the inner sum of j into regions where class i has feature overlap and does not have M A = + j: class i overlaps j j: class i does not overlap j In the first sum, since P(ω i j ) = each term is zero and the entire sum vanishes. In the second sum, when P(ω i j ) = 0 by a limiting argument one can show that P(ω i j )log M (P(ω i j )) = 0, and thus the entire sum also vanishes. Thus we have shown that A = 0. Problem 5.7 (the divergence increase for Gaussian densities) To begin this problem, we are told that when we consider the generation of original feature vectors x of dimension m, that the two classes i and j have the same covariance matrix Σ. If we then add an [ additional ] feature x m+ so that we desire to consider the covariances x of vectors defined as, we will assume that these larger vectors also have equal x m+ covariance matrices when considered from class i and j. In this case that covariance matrix will be take of the form [ ] Σ r ˆΣ = r T σ. When two classes have equal covariances the trace terms in the divergence d ij given by Equation vanish and d ij simplifies to d ij = (ˆµ i ˆµ j )ˆΣ (ˆµ i ˆµ j ). (3) Here ˆµ i and ˆµ j are the mean vectors for the larger vector with the scalar x m+ appended to the original x, for example [ ] µi ˆµ i =, where µ i is the mean of x m+ under class i. The same notation will be used for class j. Thus to find a recursive relationship for d ij we need a way of decomposing the inner product defined above. Since we are to assume that the covariance matrix, ˆΣ, for the vector block form as ˆΣ µ i [ Σ r r T σ ]. [ x x m+ ] is given in Where Σ is a m m matrix and r is a m column vector. Thus to further simplify the divergence we need to derive an expression for ˆΣ. To compute this inverse we will multiply ˆΣ on the left by a block matrix with some variable entries which we hope we can find suitable 87

88 values for and thus derive [ the block ] inverse. As an example of this lets multiply ˆΣ on the Σ 0 left by the block matrix b, where b is a m dimensional vector and d is a scalar. d Currently, the values of these two variables are unknown. When we multiply by this matrix we desire to find b and d such that [ ][ ] [ ] Σ 0 Σ r I 0 b d r T σ =. (4) 0 Equating the block multiplication result on the left to the components of the block matrix on the right gives b Σ+dr T = 0. for the (,) component. This later equation can be solved for b by taking transposes and inverting Σ as b = Σ rd. If we take d = and b given by the solution above, the product on the left-hand-side given by Equation 4 does not becomes the identity but is given by [ ][ ] [ ] Σ 0 Σ r I Σ r T Σ r T σ = r 0 σ r T Σ. (5) r Note what we have just done is the forward solve step in Gaussian elimination. Taking the inverse of both sides of this later equation we find [ Σ r r T σ ] [ Σ 0 r T Σ α ] = [ I Σ r 0 σ r T Σ r ]. or [ ] [ ] Σ r I Σ r T σ = [ ] r Σ 0 0 σ r T Σ r r T Σ. [ ] I Σ Thus it remains to find the inverse of the block matrix r 0 σ r T Σ. This inverse is r the well known backwards solve in Gaussian elimination. Note that this inverse is given by [ ] I Σ [ ] r I ασ = r, 0 0 α where we have made the definition of the scalar α such that α σ r T Σ r. Using this result we have that [ ] [ ][ ] Σ r I ασ r T σ = r Σ 0 0 α r T Σ [ ] Σ = +ασ rr T Σ ασ r αr T Σ. (6) α Using this expression one of the required product in the evaluation of Equation 3 is given 88

89 by [ Σ r r T σ ] [ µi µ j µ i µ j ] = = = [ ][ ] Σ +ασ rr T Σ ασ r µi µ j αr T Σ α µ i µ j [ (Σ +ασ rr T Σ )(µ i µ j ) ασ r(µ i µ j ) αr T Σ (µ i µ j )+α(µ i µ j ) [ ] d+ασ rr T d ασ r(µ i µ j ) αr T. d+α(µ i µ j ) ] Where since the product Σ (µ i µ j ) appears a great number of times we defined it to be d, so d Σ (µ i µ j ). Computing the product needed to produce the full quadratic term in d ij we get ( µ T i µ T j,µ i µ j ) [ d+ασ rr T d ασ r(µ i µ j ) αr T d+α(µ i µ j ) ] = (µ i µ j ) T d + α(µ i µ j ) T Σ rr T d α(µ i µ j ) T Σ r(µ i µ j ) α(µ i µ j )r T d + α(µ i µ j ). Taking the transpose of either term we see that the third and fourth scalar products in the above expressions are equal. Combining these we get (µ i µ j ) T d+αd T rr T d+α(µ i µ j ) αd T r(µ i µ j ). Completing the square of the expression with respect to µ i µ j we have this expression given by α [ (µ i µ j ) d T r ] α(d T r) +αd T rr T d+(µ i µ j ) T d = α [ (µ i µ j ) d T r ] +(µi µ j ) T d. Thus using this and the definition of d and α we see that d ij is given by [ d ij (x,x,,x m,x m+ ) = (µ i µ j ) T Σ µi µ j d T r ] (µ i µ j )+ σ r T Σ r [ µi µ j (µ i µ j ) T Σ r ] = d ij (x,x,,x m )+. σ r T Σ r If the new feature is uncorrelated with the original ones then we have the vector r equal zero and the second expression follows from this one. Problem 5.8 (the divergence sums for statistically independent features) Consider the divergence d ij defined by ( ) p(x ωi ) d ij = (p(x ω i ) p(x ω j ))ln dx. (7) p(x ω j ) x 89

90 Then if the features are statistically independent in each class we have p(x ω i ) = l p(x k ω i ). Thus the logarithmic term above becomes ( ) ( l ) p(x ωi ) p(x k ω i ) ln dx = ln p(x ω j ) p(x k ω j ) k= l ( ) p(xk ω i ) = ln. p(x k ω j ) Then we get for d ij is d ij = k= x k= k= l ( ) p(xk ω i ) (p(x ω i ) p(x ω j ))ln dx. p(x k ω j ) Since the logarithmic term only depends on x k (and not the other k s) we can integrate out them by performing the x integration for all variables but x k. This then gives d ij = l k= x k (p(x k ω i ) p(x k ω j ))ln ( ) p(xk ω i ) dx k, p(x k ω j ) which is the sum of l scalar divergences each one over a different variable. Problem 5.9 (deriving the Chernoff bound) The books equation 5.7 is given by P e P(ω ) s P(ω ) s p(x ω ) s p(x ω ) s dx for 0 s. (8) When the densities p(x ω i ) for i =, are d-dimensional multidimensional Gaussians then { p(x ω i ) = exp } (π) d/ Σ i / (x µ i) T Σ i (x µ i ), (9) so the product in the integrand in Equation 8 is given by p(x ω ) s p(x ω ) s = (π) ds (π) d( s) Σ Σ s s { exp s (x µ ) T Σ (x µ ) ( s) } (x µ ) T Σ (x µ ). Expanding the terms in the exponential we find (ignoring for now the factor ) sx T Σ x sxt Σ µ +sµ T Σ µ +( s)x T Σ x ( s)xt Σ µ +( s)µ Σ µ. 90

91 Grouping the quadratic, linear, and constant terms we find x T (sσ +( s)σ )x xt (sσ µ +( s)σ µ )+sµ T Σ µ +( s)µ T Σ µ. Using this expression the product we are considering then becomes p(x ω ) s p(x ω ) s = exp (π) d Σ Σ s s { exp { ( sµ T Σ µ +( s)µ T Σ µ ) } (30) ( x T (sσ +( s)σ )x xt (sσ µ +( s)σ µ ) )}. Thus we want to integrate this expression over all possible x values. The trick to evaluating an integral like this is to convert it into an integral that we know how to integrate. Since this involves the integral of a Gaussian like kernel we might be able to evaluate this integral by converting exactly it into the integral of a Gaussian. Then since it is known that the integral over all space of a Gaussians is one we may have evaluated indirectly the integral we are interested in. To begin this process we first consider what the argument of the exponential (without the /) of a Gaussian with mean θ and covariance A would look like (x θ) T A (x θ) = x T A x x T A θ+θ T A θ. (3) Using this expression to match the arguments of the quadratic and linear terms in the exponent in Equation 30 would indicate that A = sσ +( s)σ and A θ = sσ µ +( s)σ µ. Thus the Gaussian with a mean value θ and covariance A given by A = (sσ +( s)σ ) (3) θ = A(sΣ µ +( s)σ µ ) = (sσ +( s)σ ) (sσ µ +( s)σ µ ), (33) wouldevaluatetohavingexactlythesameexponentialterms(modulotheexpressionθ T A θ). The point of this is that with the definitions of A and θ we can write x T (sσ +( s)σ )x x T (sσ µ +( s)σ µ ) = (x θ) T A (x θ) θ T A θ, so that the integral we are attempting to evaluate can be written as p(x ω ) s p(x ω ) s dx = (π) Σ d Σ s s { exp exp ( sµ T Σ µ +( s)µ T Σ { } (x θ)t A (x θ) dx. ) } { µ exp } θt A θ In effect what we are doing is completing the square of the argument in the exponential. Since we know that multidimensional Gaussians integrate to one, this final integral becomes { exp } (x θ)t A (x θ) dx = (π) d/ A /. (34) 9

92 In addition, the argument in the exponential in front of the (now evaluated) integral is given by sµ T Σ µ +( s)µ T Σ µ θ T A θ. (35) When we put in the definition of A and θ given by Equations 3 and 33 we have that θ T A θ is equivalent to three (somewhat complicated) terms θ T A θ = (sµ T Σ +( s)µ T Σ )(sσ +( s)σ ) (sσ µ +( s)σ µ ) = s µ T Σ (sσ +( s)σ ) Σ µ + s( s)µ T Σ (sσ +( s)σ ) Σ µ = ( s) µ T Σ (sσ +( s)σ ) Σ µ. Given that we still have to add the terms sµ T Σ µ +( s)µ T Σ µ to the negative of this expression we now stop and look at what our end result should look like in hopes of helping motivate the transformations to take next. Since we might want to try and factor this into an expression like (µ µ ) T B(µ µ ) by expanding this we see that we should try to get the expression above into a three term form that looks like µ T Bµ µ T Bµ +µ T Bµ, (36) for some matrix B. Thus lets add sµ T Σ µ + ( s)µ T Σ µ to the negative of θ T A θ and write the result in the three term form suggested by Equation 36 above. We find that Equation 35 then becomes when factored in this way [ ] sµ T Σ sσ (sσ +( s)σ ) Σ µ (37) [ ] s( s)µ T Σ (sσ +( s)σ ) Σ µ (38) [ ] + ( s)µ T Σ ( s)σ (sσ +( s)σ ) Σ µ. (39) We now use the inverse of inverse matrix sums lemma (IIMSL) given by (A +B ) = A(A+B) B = B(A+B) A, (40) to write the matrix products in the middle term of the above expression as Σ (sσ +( s)σ ) Σ = (( s)σ +sσ ). (4) Recognizing this matrix as one that looks familiar and that we would like to turn the others into lets now hope that the others can be transformed into a form that looks like that. To further see if this is possible, and to motivate the transformations done next, consider how the desired expression would look like expanded as in Equation 36. We have without the factor of s( s) the following (µ µ ) T (( s)σ +sσ ) (µ µ ) = µ T (( s)σ +sσ ) µ (4) µ T (( s)σ +sσ ) µ (43) + µ T (( s)σ +sσ ) µ. (44) Since as just shown the middle terms match as desired, looking at the terms Equation 37 and 4, to have the desired equality we want to show if we can prove s [ ] Σ sσ (sσ +( s)σ ) Σ = s( s)(( s)σ +sσ ), (45) 9

93 and ( s) [ ] Σ ( s)σ (sσ +( s)σ ) Σ = s( s)(( s)σ +sσ ), (46) the similar expression for the terms Equation 39 and 44. To show that in fact this matrix difference is correct we will use another matrix identity lemma. This time we will use the Woodbury identity which can be written as (A+UCV) = A A U(C +VA U) VA. (47) If we specialize this identity by taking C and V to both be identity matrices we obtain (A+U) = A A U(I +A U) A = A A (U +A ) A. Using this last expression with A = sσ and U = ( s)σ we can derive that (sσ +( s)σ ) = s Σ ( s Σ s Σ + ) s Σ s Σ = s Σ ( s) Σ (( s)σ +sσ ) Σ. s Multiplying this last expression by sσ on the left and Σ on the right to get sσ (sσ +( s)σ ) Σ = Σ ( s)(( s)σ +sσ ). This last expression gives that Σ sσ (sσ +( s)σ ) Σ = ( s)(( s)σ +sσ ), which is equivalent to the desired Equation 45. Using exactly the same steps one can prove Equation 46. In summary then we have shown that p(x ω ) s p(x ω ) s A / dx = Σ Σ s s { exp } s( s)(µ µ ) T (( s)σ +sσ ) (µ µ ). It remains to evaluate the coefficient A / Σ Σ s s. Taking the determinant of both sides of Equation 4 and solving for the expression A defined in Equation 3 we find A = Σ Σ ( s)σ +sσ. (48) When we put this into what we have found for p(x ω ) s p(x ω ) s dx we obtain p(x ω ) s p(x ω ) s Σ s Σ s dx = ( s)σ +sσ { exp } s( s)(µ µ ) T (( s)σ +sσ ) (µ µ ). 93

94 If we define the above expression equal to e b(s) we see that b(s) is given by b(s) = s( s)(µ µ ) T (( s)σ +sσ ) (µ µ ) + { } ln ( s)σ +sσ. (49) Σ s Σ s When this is combined with Equation 8 we have finally proved the Chernoff inequality. If we now consider the case when Σ = Σ = Σ we have b(s) = s( s) (µ µ ) T Σ (µ µ ). Then as (µ µ ) T Σ (µ µ ) is a scalar multiplier of the function s( s), its value does not change the location of the extrema of b(s). To find the extrema of b(s) we take the first derivative, set the result equal to zero and solve for s. We find b (s) = s s = 0 s =. the second derivative of the function b(s) is given by b (s) =. Since this is negative s = is a maximum of b(s) or a minimum of e b(s). Problem 5.0 (the mixture scatter matrix is the sum of S w and S b ) Consider evaluating the expectation in the definition of S m by conditioning on each class S m = E[(x µ 0 )(x µ 0 ) T ] = M E[(x µ 0 )(x µ 0 ) T x ω i ]P i. where P i = P(x ω i ). Then write x µ 0 = x µ i +µ i µ 0 and expand the inner product above as (x µ 0 )(x µ 0 ) T = (x µ i )(x µ i ) T +(x µ i )(µ i µ 0 ) T +(µ i µ 0 )(µ i µ 0 ) T. Then taking the conditional expectation of the above expression with respect to ω i since E[x µ i x ω i ] = 0 the middle term in above vanishes. The last term does not depend on x and is therefore a constant with respect to the expectation and we get for S m S m = M P i E[(x µ i )(x µ i ) T x ω i ]+ which when we recall the definitions of S w and S b given by S w = S b = M P i (µ i µ 0 )(µ i µ 0 ) T, M P i E[(x µ i )(x µ i ) T x ω i ] (50) M P i (µ i µ 0 )(µ i µ 0 ) T, (5) we recognize as expressing S m as S m = S w +S b. 94

95 Problem 5. (bounds on the cross-correlation coefficient) When we take the vectors x and y as x i x i x =. and y = y i y i., x Ni y Ni the Schwartz s inequality x T y x y show that ρ ij where ρ ij is defined by ρ ij = N n= x niy nj N n= x. N ni n= y nj Problem 5. (the divergence of a two class problem) The divergence between two Gaussians is given by Equation. If we assume that the Gaussians have the same covariance then Σ = Σ = Σ and the divergence becomes d = (µ µ ) T Σ (µ µ ) = trace((µ µ ) T Σ (µ µ )) = trace(σ (µ µ )(µ µ ) T ). When the classes are equiprobable P = P =. Then the within class scatter matrix Equation 3 becomes S w = Σ + Σ = Σ. Now lets compute S b using Equation 4. We have S b = M P i (µ i µ 0 )(µ i µ 0 ) T = [ (µ µ 0 )(µ µ 0 ) T +(µ µ 0 )(µ µ 0 ) T]. Since µ 0 = M P iµ i = (µ +µ ) when we compute the needed differences to compute S b we calculate S b = [ 4 (µ µ )(µ µ ) T + ] 4 (µ µ )(µ µ ) T = 4 (µ µ )(µ µ ) T. Thus if we consider the expression trace(s w S b ) we see that it equals in this case trace(s w S b) = 4 trace(σ (µ µ )(µ µ ) T ). We see that this is proportional to the expression d derived above. 95

96 Problem 5.3 (the number of combinations in backwards elimination) See the notes on Page 80 where we derive this expression. Problem 5.4 (the derivative of a trace) We want to evaluate A trace{(at S A) (A T S A)} Thealgebraicprocedureforcomputing derivatives like trace{f(a)}wheref( )isamatrix A function of a matrix argument is discussed in [3]. The basic procedure is the following. We consider the matrix derivative as several scalar derivative (one derivative for each component a kl of A). We pass the derivative of a kl though the trace operation and take the scalar F(A) derivative of various matrix expressions i.e. a kl. Taking these derivatives is easier if we introduce the matrix V(k,l) which is a matrix of all zeros except for a single one at the location (k,l). This is a helpful matrix to have since a kl A = V(k,l). Once we have computed the derivative of the argument of the trace F(A) with respect to a kl we need to write it in the form F(A) a kl = i g i (A)V(k,l)h i (A). We can then take the trace of the above expression and use the permutability of matrices in the argument of the trace to write { } { } F(A) trace = trace g i (A)V(k,l)h i (A) a kl i = trace{g i (A)V(k,l)h i (A)} i = i trace{h i (A)g i (A)V(k,l)}. (5) Finally we use the property of the trace to conclude that for any n n matrix M m k m k MV(k,l) = 0. 0, m n,k m nk or a matrix with the kth column of M in the lth column. Since the only nonzero column is the lth, to take the trace of this matrix, we need to find what the element of the lth row in 96

97 that column is. From the above we see that this element is m lk. Thus we have just argued that trace{mv(k,l)} = M(l,k). When we reassemble all elements, from this result, to compute the full matrix derivative of trace{ma} we see that A trace{ma} = MT. Back to Equation 5 we can use the above to get the full matrix derivative A trace{f(a)} = i For this problem we now implement this procedure. (h i (A)g i (A)) T. (53) To begin we evaluate the a kl derivative of (A T S A) (A T S A). From the product rule we have [ (A T S A) (A T S A) ] [ ] [ ] = (A T S A) (A T S A)+(A T S A) (A T S A). a kl a kl a kl To evaluate the a kl derivative of (A T S A) recall that if F(A) = G (A) then Thus we get (A T S A) F(A) a kl = G (A) G(A) a kl G (A). (54) a kl = (A T S A) (AT S A) a kl (A T S A). Thus we need to evaluate the derivative of A T S A (a similar needed derivative is of A T S A). We get (A T S A) = V T (k,l)s A+A T S V(k,l). a kl Combining these results we get a kl (A T S A) (A T S A) = (A T S A) [ V T (k,l)s A+A T S V(k,l) ] (A T S A) (A T S A) +(A T S A) [ V T (k,l)s A+A T S V(k,l) ]. Then for each term (there are four of them) once we take the trace we can write each one as g i (A)V(k,l)h i (A) for functions g i ( ) and h i ( ) for i =,,3,4 by using trace(a T ) = trace(a), if needed. We will need to use that identity for the first and third terms. We get g (A) = (A T S A)(A T S A) A T S, and h (A) = (A T S A) g (A) = (A T S A) A T S, and h (A) = (A T S A) (A T S A) g 3 (A) = A T S, and h 3 (A) = (A T S A) g 4 (A) = (A T S A) A T S, and h 4 (A) = I. 97

98 Once we have done this we use Equation 53 (but without the transpose yet) to get ( A trace{ (A T S A) (A T S A) } ) T = (A T S A) (A T S A)(A T S A) A T S (A T S A) (A T S A)(A T S A) A T S +(A T S A) A T S +(A T S A) A T S. Thus taking the transpose of both sides we finally find A trace{ (A T S A) (A T S A) } = S A(A T S A) (A T S A)(A T S A) +S A(A T S A), as we were to show. Problem 5.7 (the eigenstructure for S w S b in a two class problem) In a two class problem M =, P +P =, and we have S b given by S b = M P i (µ i µ 0 )(µ i µ 0 ) T = P (µ µ 0 )(µ µ 0 ) T +P (µ µ 0 )(µ µ 0 ) T. Since µ 0 = M P iµ i = P µ +P µ we have that µ µ 0 = µ P µ P µ = ( P )µ P µ = P (µ µ ) µ µ 0 = P µ +( P )µ = P (µ µ ). Using these we see that S b is given by Thus the matrix S w S b is S b = P P (µ µ )(µ µ ) T +P P (µ µ )(µ µ ) T = P P (µ µ )(µ µ ) T. P P S w (µ µ )(µ µ ) T. Since the matrix (µ µ )(µ µ ) T is rank one the matrix Sw S b is rank one, and thus we have one non-zero eigenvalue (and its corresponding eigenvector). Consider the vector v Sw (µ µ ) and observe that Sw S bv = P P Sw (µ µ )(µ µ ) T Sw (µ µ ) = (P P (µ µ ) T Sw (µ µ ))Sw (µ µ ) = λ v, where we take λ = P P (µ µ ) T S w (µ µ ). Thus v is an eigenvector of S w S b and λ is its corresponding eigenvalue. 98

99 Problem 5.8 (orthogonality of the eigenvectors of S S ) Since S and S can be simultaneously diagonalized, there exists and invertible matrix B such that B T S B = I and B T S B = D, where D is a diagonal matrix. Since B is invertible we can solve for S and S in terms of B and D as S = B T B and S = B T DB. Using these consider the product S S = (BB T )(B T DB ) = BDB. Let v i beaneigenvector of S S with eigenvalue λ i. Then by the definitionof aneigenvector we have S S v i = λ i v i, or from the expression for S S in terms of B and D BDB v i = λ i v i. This gives two expressions for v i v i = λ i BDB v i v i = λ i BD B v i. Now consider v T i S v j, using the first of these we will replace v i with λ i BDB v i, S with B T B, and using the second expression above v j with λ j BD B v j to get v T i S v j = λ j λ i v T i B T DB T B T B BD B v j = λ j λ i v T i B T B v j = λ j λ i v T i S v j. Thus ( λ ) j vi T λ S v j = 0. i So if i j (where we assume that λ i λ j ) then the last equation shows that v i S v j = 0 as we were to show. 99

100 Feature Generation I: Linear Transforms Notes on the text Notes on basis vectors and images We define a separable transformation of X to be one where the transform Y is given by Y = U H XV. (55) A separable transformation can be thought of as two successive transformations, one over the columns of X and another over the rows of the matrix product U H X. To see this first define the product U H X as Z and write U H as u H 0 u H. u H N so that considered as a block matrix product, the expression Z = U H X is the product of a N block matrix times a block matrix or u H 0 u H u H 0 X. X = u H X.. u H N X u H N This result has N rows where each one is of the form u H i X or the inner product transform of the N columns of X. Thus the transformation U H X is a transformation over the rows of X. Now consider the product (U H X)V, which we can write as ZV = (V H Z H ) H. Notice that V H Z H is the same type of transformation as we just discussed i.e. inner product transforms of the columns of Z H. Equivalently inner product transforms of the rows of Z = U H X, proving the statement made above. Note that some of the calculations for Example 6. are performed in the MATLAB script chap 6 example 6.m. Notes on independent component analysis (ICA) From the derivation in the book we have that at a stationary point If we postmultiply by W and use W T W = I we get J(W) W WT = E[I φ(y)y T ] = 0. (56) J(W) W = E[I φ(y)yt ]W = 0. The expression J(W) W = E[I φ(y)yt ]W is called the natural gradient. 00

101 Notes on the discrete Fourier transform We define the scalar W N as an Nth root of unity or ( W N exp j π ) N, (57) with j and the matrix W H is given by W N WN WN 3 W N N WN WN 4 WN 6 W (N ) N N W N N W (N ) N W 3(N ) N W (N )(N ) N W N N W (N ) N W 3(N ) N W (N )(N ) N From this we see that the (i,j)th component of W H is given by W N N W (N ) N W (N )(N ) N W (N )(N ) N. (58) (W H )(i,j) = N W ij N. (59) Using W H above and the fact that WN = W N the matrix W is given by W N W N W 3 N W (N ) N W N W 4 N W 6 N W (N ) N N W (N ) N W (N ) N W 3(N ) N W (N )(N ) N W (N ) N W (N ) N W 3(N ) N W (N )(N ) N From this we see that the (i,j)th component of W is given by W (N ) N W (N ) N W (N )(N ) N W (N )(N ) N. (60) W(i,j) = N W ij N. (6) These expressions will be used in the problems and derivations below. Notes on the two-dimensional Fourier transform The two-dimensional discrete Fourier transform is defined as Y(k,l) = N N N m=0 n=0 X(m,n)W km N Wln N. (6) RecallingviaEquation59thatthe(k,m)element ofw H is N W km N intermsoftheelements of W H this sum is Y(k,l) = N N X(m,n)(W H )(k,m)(w H )(l,n). m=0 n=0 0

102 Using Equation 76 to convert this double sum into a matrix product we have we have Y = W H X(W H ) T but since W H is symmetric we have Y = W H XW H. (63) Notes on the Haar transform For the Haar transform, given the index n, we have n basis functions index by k where k = 0,,, n and denoted by h k (z). Given a value of the index k in the range just specified we can convert this index k uniquely into two other nonnegative integers p and q. The integer p (for power) is the largest natural number such that p k and then q is the remainder. Thus let q be given by q = k p. Thus with these two definitions of p and q we have written the index k as k = p +q. (64) This definition works for p 0, where if p = 0 then q = 0 or. For example, if we take n = 3 then there are 8 basis functions k = 0,,,,7 and we have the mapping described above from k into (q,p) given by k = 0 p = 0 and q = 0 k = p = 0 and q = k = p = and q = + = k = 3 p = and q = 3 + = k = 4 p = and q = 4 + = k = 5 p = and q = 5 + = k = 6 p = and q = 6 + = 3 k = 7 p = and q = 7 + = 4. The reason for introducing the indexes (p,q) is that it is easier to write the expression for the basis functions h k (z) in terms of the numbers p and q. Given the above equivalence we can convert sums over k (the number of basis functions) into a double sum over p and q as n k=0 h k (z) h p=0,q=0 (z)+h p=0,q= (z)+ n p p= q= h pq (z), (65) since the range of p is 0 p n and q is between q p. Note that due to the slightly different conditions that happen when k = 0 and k = in Equation 64, we have represented these terms on their own and outside of the general summation notation. 0

103 Notes on the two-band discrete time wavelet transform (DTWT) Most of the notes in this section are verifications of the expressions given in the book. While there are no real comments with this section these notes provide more details in that they explicitly express some of the intermediate expressions which make verification of the proposed expressions easier. We begin with the two-band filter equations, when our two filters have impulse response functions given by h 0 (k) and h (k). In that case we have y 0 (k) = l y (k) = l x(l) h 0 (n l) n=k (66) x(l) h (n l) n=k. (67) When k = 0 for y 0 (k) then n = 0 and the first equation above gives y 0 (0) = l x(l)h 0 (0 l) = +x( 3)h 0 (3)+x( )h 0 ()+x( )h 0 () + x(0)h 0 (0) + x()h 0 ( )+x()h 0 ( )+x(3)h 0 ( 3)+. When k = for y 0 (k) then n = and we get y 0 () = l x(l)h 0 ( l) = +x( 3)h 0 (5)+x( )h 0 (4)+x( )h 0 (3) + x(0)h 0 () + x()h 0 ()+x()h 0 (0)+x(3)h 0 ()+. The same type of expressions will hold for y (k) but with h 0 replace with h. When we list these equations in a matrix form we get. y 0 ( ) y ( ) y 0 ( ) y ( ) y 0 (0) y (0) y 0 () y () y 0 () y (). = h 0 ( ) h 0 ( 3) h 0 ( 4) h 0 ( 5) h 0 ( 6) h ( ) h ( 3) h ( 4) h ( 5) h ( 6) h 0 (0) h 0 ( ) h 0 ( ) h 0 ( 3) h 0 ( 4) h (0) h ( ) h ( ) h ( 3) h ( 4) h 0 () h 0 () h 0 (0) h 0 ( ) h 0 ( ) h () h () h (0) h ( ) h ( ) h 0 (4) h 0 (3) h 0 () h 0 () h 0 (0) h (4) h (3) h () h () h (0) h 0 (6) h 0 (5) h 0 (4) h 0 (3) h 0 () h (6) h (5) h (4) h (3) h () x( ) x( ) x(0) x(+) x(+) I explicitly included more terms than in the book so that the pattern of the elements is as clear as possible. In practice, while we can tolerate a non causal impulse filters (h 0 (k) and 03..

104 h (k) nonzero for negative k) but since we would like the matrix above to be of finite extent we require that h 0 and h have only a finite number of nonzero terms. As a matrix equation we can write this as y = T i x, where T i is the mapping into the wavelet domain. Once we have constructed the two outputs y 0 (k) and y (k) we seek another pair of filters of a special form that act as an inverse to the above mapping, in that they can synthesis the original signal, x, from the output pair y 0 and y. In sort of the same way we split x into y 0 and y we will process y 0 and y independently and then combined them to get x. The two functions that we combine are x 0 (n) = k x (n) = k y 0 (k)g 0 (n k) (68) y (k)g (n k), (69) and the combination of x 0 and x gives x x(n) = x 0 (n)+x (n) = k y 0 (k)g 0 (n k)+y (k)g (n k). To derive the matrix representation of this mapping we again write out the above equation for a couple of values of n to get a feel for the coefficients that result. For n = 0 we have For n = we have x(0) = +y 0 ( )g 0 (4)+y ( )g (4)+y 0 ( )g 0 ()+y ( )g () + y 0 (0)g 0 (0)+y (0)g (0) + y 0 ()g 0 ( )+y ()g ( )+y 0 ()g 0 ( 4)+y ()g ( 4)+. x() = +y 0 ( )g 0 (5)+y ( )g (5)+y 0 ( )g 0 (3)+y ( )g (3) + y 0 (0)g 0 ()+y (0)g () + y 0 ()g 0 ( )+y ()g ( )+y 0 ()g 0 ( 3)+y ()g ( 3)+. Thus as a matrix we have the mapping from y to x in terms of values of g 0 and g in great detail as. y 0 ( ) y ( ) x( ) g x( ) 0 () g () g 0 (0) g (0) g 0 ( ) g ( ) g 0 ( 4) g ( 4) g 0 ( 6) g ( 6) y 0 ( ) g 0 (3) g (3) g 0 () g () g 0 ( ) g ( ) g 0 ( 3) g ( 3) g 0 ( 5) g ( 5) x(0) = g 0 (4) g (4) g 0 () g () g 0 (0) g (0) g 0 ( ) g ( ) g 0 ( 4) g ( 4) x(+) g x(+) 0 (5) g (5) g 0 (3) g (3) g 0 () g () g 0 ( ) g ( ) g 0 ( 3) g ( 3). g 0 (6) g (6) g 0 (4) g (4) g 0 () g () g 0 (0) g (0) g 0 ( ) g ( ) As a matrix equation we can write this as x = T o y, 04 y ( ) y 0 (0) y (0) y 0 () y () y 0 () y ().

105 where T o is the mapping out of the wavelet domain. For good reconstructive properties i.e. to be able to take the inverse transform of the direct transform and get the original signal back again we must have T i T o = T o T i = I. This creates the biorthogonality conditions that h i and g i must satisfy. Problem Solutions Problem 6. (some properties of the transformation Y = U H XV) Part (a): Given X = UYV H, lets write the matrix U in block form as columns as U = [ u 0 u u N ], the matrix V H in block form as rows V H = v H 0 v H. v H N and the matrix Y as the N N matrix with components Y(i,j). Then viewing the product UYV H in block form as a N matrix (the block matrix U) times a N N matrix (the block matrix Y) times a N matrix (the block matrix V H ) we get for X the product [ ] u0 u u N, Y(0,0) Y(0,) Y(0,N ) Y(,0) Y(,) Y(,N ).... Y(N,0) Y(N,) Y(N,N ) v H 0 v H. v H N Using block matrix multiplication we have that the product of the two right most matrices is given by Y(0,0)v0 H +Y(0,)v H + +Y(0,N )vn H Y(,0)v0 H +Y(,)vH + +Y(,N )vh N.. Y(N,0)v0 H +Y(N,)v H + +Y(N,N )vn H. Or in summation notation N j=0 Y(0,j)vH j N j=0 Y(,j)vH j. N j=0 Y(N,j)vH j. 05

106 Thus then X equals [ u 0 u u N ] times this result or X = = N j=0 N i=0 Y(0,j)u 0 v H j + N j=0 N Y(i,j)u i vj H, j=0 Y(,j)u v H j + + N j=0 Y(N,j)u N v H j as we were to show. Part (b): To compute the value of A ij,x we first recall the definition of the matrix inner product, of A,B and the definition of A ij of A ij = u i v H j X = N N m=0 n=0 A (m,n)b(m,n), (70). Then using the rank-one decomposition of X of N N i =0 j =0 Y(i,j )u i v H j, (7) Equation 70 then requires us to compute the (m,n)th component of the matrix A ij and of u i vj H since X(m,n) is obtained by summing such elements via Equation 7. Consider the (m,n)th component of u i vj H. Recall u i is the ith column of U and as vj H is the jth row of V H we see that v j is the jth column of V. Then the product u i vj H looks like u i v H j = = U(0, i) U(, i). U(N,i) [ V(0,j) V(,j) V(N,j) ] U(0,i)V(0,j) U(0,i)V(,j) U(0,i)V(N,j) U(,i)V(0,j) U(,i)V(,j) U(,i)V(N,j)... U(N,i)V(0,j) U(N,i)V(,j) U(N,i)V(N,j) Thus the (m,n) element of the matrix u i v H j is and the conjugate of the (m,n)th element of the matrix u i v H j is. U(m,i)V(n,j), (7) U(m,i) V(n,j). (73) 06

107 Using these results we find A ij,x = u i v H j, = = N m=0 n=0 N N N i =0 j =0 Y(i,j )u i v H j N U(m,i) V(n,j) N i =0 j =0 N Y(i,j ) m=0 ( N ) N Y(i,j )U(m,i )V(n,j ) i =0 j =0 N U(m,i) U(m,i ) n=0 V(n,j)V(n,j ). To evaluate these sums recall that U is a unitary matrices and thus UU H = I and U H U = I. If we consider the (i,i )th element of the product U H U = I we get or N (U H )(i,n)u(n,i ) = I(i,i ), n=0 N n=0 U(n,i) U(n,i ) = I(i,i ), (74) where I(i,i ) is the Kronecker delta symbol, i.e. I(i,i ) = if i = i and is 0 otherwise. Since V is also a Hermitian matrix a similar result hold for sums of components of V. Sums like this appear twice in the above expression for A ij,x and we have the following as we were to show. A ij,x = N N i =0 j =0 Y(i,j )I(i,i )I(j,j ) = Y(i,j), Problem 6. (separable transforms) We first recall that to compute the lexicographic row ordered vector x the rows of the matrix X are ordered sequentially in a column vector. Thus if we let X(i,:) be a row vector from X then the lexicographically ordered row vector x is given by x = X(0,:) T X(,:) T. X(N,:) T Next recall that if A is a m n matrix and B is a p q matrix the Kronecker outer product of two matrices A and B denoted as A B is defined as the mn pq matrix a B a B a 3 B a n B a B a B a 3 B a n B A B =..... (75) a m B a m B a m3 B a mn B 07.

108 Thus in terms of the matrices of this problem we have U V given by the matrix U(0,0)V U(0,)V U(0,)V U(0,N )V U(,0)V U(,)V U(,)V U(,N )V..... U(N,0)V U(N,)V U(N,)V U(N,N )V Then when we multiply this by the lexicographic ordered vector x we get U(0,0)VX(0,:) T +U(0,)VX(,:) T + +U(0,N )VX(N,:) T U(,0)VX(0,:) T +U(,)VX(,:) T + +U(,N )VX(N,:) T.. U(N,0)VX(0,:) T +U(N,)VX(,:) T + +U(N,N )VX(N,:) T This is a block column matrix of size N where the blocks are N N matrices with the m block element given by N i=0 U(m,i)VX(i,:) T. Since X(i,:) T is a column vector the product VX(i,:) T is another column vector and the above is the sum of column vectors. The nth element of this column vector is given by N j=0 V(n,j)X(i,j). Thus the nth element of the mth block in the product (U V)x is N i=0 N X(i,j)U(m,i)V(n,j). j=0 If we have the desired equality this should equal the value Y(m,n). To show this we can simply recall that Y = UXV T and as such we can compute the (m,n)th element of this product. Using the summation definition of a matrix product we find Y(m,n) = = N (UX)(m,i)(V T )(i,n) = i=0 N j=0 N i=0 N j=0 U(m,j)X(j,i)V(n,i) N X(j,i)U(m,j)V(n,i), (76) i=0 which is equivalent to the expression above showing the desired equivalence. Problem 6.3 (minimizing the MSE by using the eigenvectors of R x ) Forafixedorthonormalbasise i fori = 0,,,,N,arandomvectorxthedecomposition x = N i=0 08 y(i)e i, (77)

109 with y(i) = e T i x. Note that since x is random the y(i) s are also random. The projection of x in the m-dimensional subspace spanned by the vectors e 0,e,,e m is given by ˆx = m i=0 y(i)e i. (78) Then as in the book the expectation of the error ǫ defined as ǫ = x ˆx can be shown to be given by E [ ǫ ] = N i=m e T i R xe i, (79) with R x the correlation matrix of x i.e. R x = E[xx T ]. Considering E[ ǫ ] as the objective function to be minimized we seek vectors e i that will achieve this minimum. Obviously E[ ǫ ] 0 and e i = 0 will make the right-hand-side of Equation 79 zero. To avoid this trivial solution we need to introduced the constraint that the vectors e i are normalized or e T i e i =. Question: I m not sure why we don t have to also introduce the orthogonality constraint of e T i e j = 0 for i j. I think the answer might be because of the functional form for our objective function E[ ǫ ]. For example, if the vectors for i and j appeared together as a product like e i Ae j for some matrix A in the objective function we would have to also introduce the constraint e T i e j = 0. If anyone knows more about this or has an opinion on this please contact me. Part (a): Given that we have the constraint e T i e i = we use the methods of constrained optimization to seek the optimum. That is we introduce Lagrange multipliers λ i and form the Lagrangian L = N i=m e T i R xe i N i=0 λ i (e T i e i ). Then taking the derivative with respect to e i for i = m,m +,,N and setting the result equal to zero gives R x e i λ i e i = 0 R x e i = λ i e i. Thus e i is the eigenvector and λ i is the eigenvalue of R x. Part (b): In this case using the normalization properties of e i we have that E [ ǫ ] = Thus to make this as small as possible we want to take e i to be the eigenvectors with the smallest eigenvalues. Part (c): For the given approximation above consider the magnitude of the variance of ˆx. N i=m λ i. 09

110 We find [( m )( Var(ˆx) = E[ˆx Tˆx] m )] = E y(i)e T i y(i)e i = E [ m i=0 i=0 ] m y(i)y(j)e T i e j i=0 [ m ] [ m ] [ m ] = E y(i) = E (e T i x) = E e T i xxt e i = = i=0 m e T i R xe i = i=0 m λ i. i=0 i=0 i=0 i=0 m e T i E[ xx T] e i i=0 (80) Thus since e i are chosen to be the eigenvectors of R x ordered from largest eigenvalue to smallest eigenvalue we see that this sum is maximal. Problem 6.4 (Karhunen-Loeve with the covariance matrix Σ x ) In the same way as earlier we have a representation of our random variable x given by Equation 77 and now we will take our approximation of x given by with y(i) = e T i x. m ˆx = y(i)e i + i=0 N Part (a): Consider the expected square error E[ x ˆx ]. We find N E[ x ˆx ] = E (y i c i )e i i=m [ N ] N = E (y i c i )(y j c j )e T i e j = E = i=m j=m [ N i=m ] (y i c i ) i=m N (E[yi ] E[y i]c i +c i ). i=m c i e i, (8) If we want to pick c i s that make this as small as possible, we can take the derivative with respect to c i set the result equal to zero and solve for c i we find c i E[ x ˆx ] = 0 E[y i ]+c i = 0. This gives c i = E[y i ], for i = m,m+,,n. 0

111 Part (b): We now want to ask for an approximation to x given by ˆx = m i=0 y i e i + N i=m E[y i ]e i, how do we pick the orthonormal basis vectors e i. We do that by minimizing the square norm of the error ǫ defined as ǫ = x ˆx. We find using the same techniques as the sequence of steps around Equation 80 and recalling that y i = e T i x we have [ N ] [ N ] E[ ǫ ] = E (y i E[y i ]) = E (e T i x et i E[x]) = E = i=m [ N ] (e T i (x E[x])) = E i=m i=m [ N ] e T i (x E[x])(x E[x])T e i i=m N e T i E[(x E[x])(x E[x]) T ]e i = i=m N i=m e T i Σ x e i. (8) ThustopicktheorthonormalbasisthatminimizesE[ ǫ ]weminimizeequation8subject totheconstraintthate T i e i =. IntroducingLagrangemultiplierslikeinthepreviousproblem we find e i are the eigenvectors of Σ x. Part (b): To make the expression for E[ ǫ ] as small as possible we we order these eigenvectors so that they are ranked in decreasing order of their eigenvalues, therefore the vectors e m,e m+,,e N will betheeigenvectors ofσ x corresponding tothen msmallest eigenvalues. Problem 6.5 (the eigenvalues of X H X and XX H are the same) Let λ be a nonzero eigenvalue of XX H with eigenvector v. Then by definition XX H v = λv. Now consider the vector ˆv defined by ˆv = X H v. Then X H Xˆv = X H XX H v = λx H v = λˆv. This last expression shows that ˆv is an eigenvector of X H X with eigenvalue λ. Thus both XX H and X H X have the same eigenvalues. Problem 6.6 (proving ǫ N N m=0 n=0 X(m,n) ˆX(m,n) = r i=k λ i) Recall the definition of ǫ where we have ǫ = N N m=0 n=0 X(m,n) ˆX(m,n). (83)

112 Since we have our rank k approximate matrix ˆX given by k ˆX = λi u i vi H, while the full decomposition for X is r X = λi u i vi H. We have the matrix difference X ˆX given by X ˆX r = λi u i vi H. i=0 i=0 i=k Thus the m,nth element of this matrix difference X ˆX is given by using Equation 7 r λi U(m,i)V(n,i). i=k Now recall that for a complex number x = xx when we square the above expression we have X(m,n) ˆX(m,n) r r = λi λj U(m,i)V(n,i) U(m,j) V(n,j). i=k j=k It is this expression that we will sum for m and n both running from 0 to N. When we apply this summation, then exchange the order of the sums and use Equation 74 we get ǫ = = as we were to show. r r i=k j=k N λi λj m=0 N U(m,i)U(m,j) r r r λi λj I(i,j)I(j,i) = λ i, i=k j=k i=k n=0 V(n,j)V(n,i) Problem 6.7 (an example with the SVD) The SVD decomposition of a matrix X is given by r X = λi u i vi H, i=0 where u i and v i are the eigenvectors (with common eigenvalue λ i ) of XX H and X H X respectively or XX H u i = λ i u i X H Xv i = λ i v i

113 It turns out that u i and v i are related more directly as u i = λi Xv i. In the MATLAB script chap 6 prob 7.m we compute XX H and X H X and the eigenvector and eigenvalues for these two matrices. We first find that X H X has u i eigenvectors (stored as columns) and eigenvalues given by and 0.0,.93,8.06. We next find that XX H has v i eigenvectors (again stored as columns) with eigenvalues given by [ ] and.93, To use the SVD decomposition one has to match the eigenvectors of XX H and X H X to use in the inner product so that their eigenvalues match. This means the decomposition of X is given by [ ] [ ]. Problem 6.8 (the orthogonality of the DFT) We begin by proving the following sum N exp (j πn ) { l = k +rn r = 0,±,±, N (k l)n = 0 otherwise n=0 (84) Since the sum above is a geometric sum from [6] we can evaluate it as N n=0 exp (j πn ) (k l)n = exp( j π (k l)n) N exp ( j π (k l)) N exp(jπ(k l)) = exp (. j π (k l)) N Thisistrueonlyiftheexpression wearesumming powersofisnotidentically (forwhich the denominator would be 0 and division is undefined). This latter will happen if the argument of the exponential exp ( j π N (k l)) is a multiple of π. This means the above sum is valid if k l N r, where r is an integer. If this previous condition holds then the numerator vanishes exp(jπ(k l)) = 0. 3

114 since k l is an integer. If k l is actually equal to and integer r then l = k +rn and the N sum is N and we have proven Equation 84. Now consider the (m,n) element of the product W H W. Using Equation 59 and 6 we get (W H W)(m,n) = N = N N k=0 N k=0 W mk N W kn N W k(m n) N = N N when we use Equation 84. Here δ mn is the Kronecker delta. k=0 exp ( j πn ) (m n)k = δ mn, Problem 6.9 (an example computing a -d DFT) Using Equation 63 or Y = W H XW H we can compute the two-dimensional DFT by computing the required matrix product. To generate the matrix W H of order N (as defined in the book) we can use the MATLAB command dftmtx(n)/sqrt(n). In the MATLAB script chap 6 prob 9.m given in the input matrix X we do this and perform the required matrix multiplications. We find that we get Y = j j j j Problem 6. (orthogonality of the discrete cosine transform) The discrete cosine transform (DCT) matrix C has elements C(n, k) given by C(n,k) = C(n,k) = when k = 0 N ( ) π(n+)k N cos N when k >, (85) and for 0 n N. We want to show that C T C = I. To do this consider the (i,j)th element of this product (denoted by (C T C)(i,j)) we have (C T C)(i,j) = N k=0 C T (i,j)c(k,j) = N k=0 C(k, i)c(k, j). (86) Lets evaluate this for various values of i and j. When i = 0 and j = 0 Equation 86 gives N k=0 C(k,0)C(k,0) = N k=0 4 C(k,0) = N k=0 N =.

115 When i = 0 and j Equation 86 gives (C T C)(0,j) = = N k=0 N C(k,0)C(k,j) = N C(k, j) N N k=0 By writing the summand above as ( ) π(k+)j cos N we can use the following identity [5] with α = πj N N k=0 and β = πj N ( cos(α+βk) = cos ( ) π(k +)j cos N k=0 ( πj = cos N + πj ) N k, α+ N β to evaluate it. In that case we have β = πj N, so Nβ = πj N, and α+ ) ( sin N β) sin ( ), (87) β = πj N + ( N ) πj N = πj. Thus we have N ( ) ( ) ( π(k +) πj sin πj ) cos = cos N sin ( ) πj. k=0 N Since for j =,,,N the value of πj is a multiple of π where we have cos( ) πj = 0 thus N ( ) π(k+) (C T C)(0,j) = cos = 0. N Next let i and j = 0 and we have (C T C)(i,0) = = N k=0 N k=0 C(k,i)C(k,0) = N C(k, i) N N k=0 k=0 ( ) π(k +)i cos = 0, N since this is the same sum we evaluated earlier. Finally, let i and j to get N (C T C)(i,j) = C(k,i)C(k,j) = N ( ) ( ) π(k +)i π(k +)j cos cos. N N N k=0 k=0 To evaluate this sum we could convert the trigonometric functions into exponentials and then use the sum of a geometric series identity to evaluate each sum, or we can evaluate it using Mathematica. In the Mathematica notebook chap 6 prob.nb we find it equals 4 sin(π(i j)) ) + sin(π(i+j)) ), sin sin ( π(i j) N 5 ( π(i+j) N

116 when neither of the denominators is zero, or in this case that i j. When i j both terms in the numerator vanish and this expression is zero. If i = j then we want to evaluate Using the following identity (C T C)(i,i) = N N k=0 N k=0 ( ) π(k +)i cos. N ( ) π(k +)i cos = N N + sin(πi) 4sin ( ), (88) πi as i is an integer the second term vanishes and we have shown that (C T C)(i,i) =. All of these elements show that C T C = I the desired expression. N Problem 6. (the discrete cosign transform) The discrete cosign transform (DCT) Y of an image X is given by computing Y = C T XC, where C is the matrix with elements given by Equation 85. This matrix can be constructed using that MATLAB function mk DCT matrix C.m. Then using the X matrix given in for this problem in the MATLAB script chap 6 prob.m we find Y to be Y = Problem 6.4 (orthogonality of the Hadamard transform) To begin recall the recursive definition of H n given by H n = H n H ( [ ]) = H n = [ ] Hn H n. (89) H n H n We will show that H T n = H n and H n = H n. To do that we will use recursion, where we will show that these two relationships are true for n = and then assume that they hold up to 6

117 and equal to some index n. We will then show that we can prove that the relationships hold for the index n+. Consider H we see that H T = H and H T H = [ ][ ] = [ ] 0 = I. 0 Then assume that Hn T = H n and Hn = Hn T. Consider Hn+ T = [ ] H T n Hn T = [ Hn H n H n H n H T n H T n ] = H n+, showing that H n+ is symmetric. Now consider Hn+ T H n+ = H n+ H n+ = [ ][ ] Hn H n Hn H n H n H n H n H n = [ ] [ ] H n 0 I 0 0 Hn = = I, 0 I showing that H n+ = HT n+. Problem 6.5 (computing the Hadamard transform) We can use the MATLAB command hadamard to compute the Hadamard matrix H n as defined in the book. Specifically, we have H n = n/hadamard(n ). In the MATLAB script chap 6 prob 5.m given in the input matrix X we compute Y = H ˆXH, where ˆX is a submatrix of the original matrix X. Problem 6.7 (the Noble identities) Noble Downsampling Identity: Recall that downsampling by M produces a new sequence y(k) generated by the old sequence ŷ(k) according to y(k) = ŷ(mk). (90) The transfer function for this operation, D(z), when the input signal is ŷ(k) is given by [8]. D(z) = M M k=0 Ŷ ) (z /M e πj M k, (9) where i =. Using that we can write the serial affect of downsampling followed by filtering with H(z) as M Ŷ (z /M e πj M )H(z). k M k=0 7

118 If we consider the combined system of filtering with H(z M M ), to get H(z )Ŷ(z), and then downsampling we have that the combined affect is given by M Ŷ M k=0 (z /M e πj M k )H ( ( ) ) z /M e πj M M k = M M the same expression as before, showing the equivalence. k=0 Ŷ (z /M e πj M k )H (z), Noble Upsampling Identity: Recall that upsampling by M produces the new sequence y(k) generated by the old sample function ŷ(k) according to { ŷ( k ) when k is an integer y(k) = M M (9) 0 otherwise Then the transfer function for this operation U(z) when the input signal is ŷ is given by [8] U(z) = Ŷ(zM ). (93) Then the affect of filtering with H(z) a signal with transfer function Ŷ(z) is H(z)Ŷ(z). Following this by upsampling by M gives the transfer function H(z M )Ŷ(zM ). This is the same as taking the input Ŷ(z) upsampling by M to get Ŷ(zM ) and then passing that output through the linear system H(z M ), showing the equivalence of the two Noble upsampling forms. Problem 6.8 (an equivalent filter bank representation) Infigure 6.5we have three pathsto generatethe outputsy 0, y, andy. When drawnwithout the traditional system box notation we get these three paths given by,x(),x(),x(0) H 0 (z) y 0,x(),x(),x(0) H (z) H 0 (z) y,x(),x(),x(0) H (z) H (z) y. Thesystemfory 0 matchesthesamesysteminfigure6.6bfory 0 whenwetake ˆF 0 (z) = H 0 (z). If we then use Noble s downsampling identity we can write the output expressions for y and y as,x(),x(),x(0) H (z) H 0 (z ) y,x(),x(),x(0) H (z) H (z ) y. We can simplify the two downsampling procedures of size on the right side of these expressions into one downsampling expression of size 4, and combine the two serial systems to get,x(),x(),x(0) H (z)h 0 (z ) 4 y,x(),x(),x(0) H (z)h (z ) 4 y. This matches the system drawn in Figure 6.6 when we take ˆF (z) = H (z)h 0 (z ) and ˆF (z) = H (z)h (z ) proving the equivalence. 8

119 Feature Generation II Notes on the text Notes on co-occurrence matrices The co-occurrence matrices provide a way to measure the relative position of gray levels in an image and as such are functions of a pixel displacement d and an angle offset φ. They rely on the joint probabilities that the given image takes gray level values at the angular direction φ and a distance d. Typically d = while φ is taken to be from {0,45,90,35} in degrees. That is for φ = 0 we need to compute P(I(m,n) = I,I(m±d,n) = I ) = number of pairs of pixels with levels = (I,I ) total number of possible pixel pairs Probability density functions can be obtained for φ = 45 where the probability we are extracting can be written P(I(m,n) = I,I(m±d,n d) = I ). To make things easier to understand assume that we have four gray levels then I(m,n) {0,,,3} and we can define the co-occurrence matrix A (with elements P(I,I ) above) when given a specification of (d,φ) as A(d,φ) = R η(0,0) η(0,) η(0,) η(0,3) η(,0) η(,) η(,) η(,3) η(,0) η(,) η(,) η(,3) η(3,0) η(3,) η(3,) η(3,3), (94) The book writes these A matrices with φ as a superscript as A φ (d). Here η(i,i ) is the number of pixel pairs at a relative position of (d,φ) which have a gray level pair (I,I ) respectively and R is the total number of pixel pairs in that orientation in the given image. Lets consider in some detail how to compute the expression R, the number of pairs of pixels we have to consider, in the co-occurrence matrix above for some common orientations. Lets first consider the case where d = and φ = 0. Then anchored at each pixel of the image we will try to look left and right (since φ = 0) by one (since d = ) and consider the resulting image values (I,I ). For simplicity assume our input image has four rows/columns. Now at the position (0,0) (upper left corner) we can only look right (otherwise we are looking off the image) which gives one pixel pair. If we are at the pixel (0,) we can look left and right for two pairs of pixels. At the pixel located at (0,) we can look left and right giving two pixel pairs, and finally at the pixel (0,3) we can look only left giving one pixel pair. In total this is +++ = 6, left/right pairs per row. Since we have four rows we have R = 6 4 = 4 possible pairs. In general for an image of dimension M N with M rows and N columns we have +(N )+ = (N )+ = N, 9.

120 pixel pairs in each row, so R(d =,φ = 0) = M(N ). If φ = 90 (again with d = ) we have +(M )+ = M pixel pairs in each column so with N columns we have R(d =,φ = 90) = N(M ), pixel pairs to consider. If φ = 45 the first row has no pixel pairs in the northeast direction or of the form (I(m,n),I(m d,n+d)) since we would be looking off the image, but has N pairs in the southwest direction of the form (I(m,n),I(m+d,n d)). All rows but the last have (N )+ = N pixel pairs to consider. The last row has N pixel pairs of the form (I(m,n),I(m d,n+d)) and none like (I(m,n),I(m+d,n d)). Thus in total we have R(d =,φ = 45) = (N )+(M )(N ) = (N )(M ). The pair count for R(d =,φ = 35) should equal that of R(d =,φ = 45) but with N and M exchanged, R(d =,φ = 35) = (M )(N ). In practice, a very simple algorithm for computing the co-occurrence matrix for given values of d and φ is to start with η(i,i ) = 0 and then walk the image over all pixels looking at the image values (I,I ) of the two pixels specified by the (d,φ) inputs and incrementing the corresponding η(i,i ). The value of A can then be obtained after the fact by summing all the elements of the η array. In the MATLAB function co occurrence.m using this very simple algorithm we compute the co-occurrence matrix for an input (d, φ). For the sample input image for (d,φ) = (,0) this routine gives I = A(,0) = , thesameasinthebook. IntheMATLABscriptdup co occurrence example.mweduplicate all of the co-occurrence matrices from this section., Notes on second-order statistics features We implemented several of the summary second-order statistics discussed in this section as MATLAB functions. In particular we implemented 0

121 The Angular second moment in ASM.m. The Contrast measure in CON.m. The Inverse difference moment measure in IDF.m. The Entropy measure in entropy.m. In general, these routines can take additional input argument parameters d and φ that specify the displacement and angular offset that are used in computing the given co-occurrence matrix one uses in the feature definition. If these additional arguments are not given then the functions above compute all four possible co-occurrence matrices, the corresponding feature from each, and then return the average of the four feature measurements. Notes on gray level run lengths For these run length features we pick a direction φ as before, then the matrix Q RL (φ) has its (i,j)thelementgivenbythenumberoftimethatthegraylevelifori = 0,, N g appears with a run length j =,, N r in the direction φ. Thus Q RL is a matrix of dimension N g N r. We implemented the computation of the gray level run length matrix Q RL in the MAT- LAB function Q RL.m. Using that in the MATLAB function dup run length example.m we duplicated the examples computing Q RL (0) and Q RL (45). We then implemented a number of derivative features based on Q RL (φ) such as The short run emphasis in SRE.m. The long run emphasis measure in LRE.m. The gray level nonuniformity measure in GLNU.m. The run length nonuniformity measure in RLN.m. The run percentage measure in RP.m. The above computations are function of the angle φ. As in the second order statistics above, to make these features rotation invariant we compute them for each of the four possible φ {0,45,90,35} values, and then return the average of these four feature measurements. Notes on local linear transformations Given the three basis vectors: b = [ ], for a local average, b = [ 0 ] for local edge detection, and b 3 = [ ] for local spot detection then the N local filters are given by all possible outer products or b b T,b b T,b b T 3,b b T,b b T,b b T 3,b 3b T,b 3b T,b 3b T 3.

Figure 7: Duplicate petasti figures used to extract Hu invariant moments from. For example b T b = b T b = [ ] = [ 0 ] = 4.

122 Figure 7: Duplicate petasti figures used to extract Hu invariant moments from. For example b T b = b T b = [ ] = [ 0 ] = All outer products are computed in the MATLAB script gen local linear transformations.m, which duplicate the 9 matrices presented in the text.. Notes on geometric moments In this section I attempted to duplicate the results in the book on using the Hu moments for image features extraction. I first got a set of images of the petasti symbol, see Figure 7. Then in the MATLAB code dup Hu moments.m we load in gif versions of these images and call the function Hu moments.m. I scaled each image to the x and y ranges [,+] before computing the moments. The dynamic ranges of the higher center moments is significantly smallerthattheearlierones(andsomeseemtovanishtozero), thustocomparetheextracted moments from each images we extract and plot the logarithm of the absolute value of the direct moments this is suggested in [4]. This gives the following phi_ phi_ phi_3 phi_4 phi_5 phi_6 phi_7 image a: image b:

123 image c: image d: image e: image f: In general these numbers look very similar for each of the images. I was, however, unable to duplicate the exact numerical results for the value of φ i from the book. If anyone sees anything wrong with what I have done or a way that I can better match the books results please let me know. Notes on Zernike moments The indices for the Zernike moments require that p = 0,,, and q p with p q even. This means that we have the following valid combinations (for a few value of p only) p = 0 so q = 0 only. p = so q = only. p = so q {,0,+}. p = 3 so q { 3,,+,+3}. p = 4 so q { 4,,0,+,+4}. p = 5 so q { 5, 3,,+,+3,+5}. The pattern at this point seems clear. Notes on Fourier features Consider the complex boundary u k with theorigin shifted by the index k 0 or the signal u k k0. Then this shifted boundary has Fourier features given by N f l = u k k0 e j π N lk = k=0 N k 0 k= k 0 u k e j N k 0 π N l(k+k0) = e j π N lk 0 u k e j π N lk. k= k 0 This last sum is e j π N lk 0 f l, since both u k and e j π N lk are periodic in the index k with a period N. Thus we have shown f l = π e j N lk 0 f l. (95) Note that a shift of origin does not change the magnitude of the Fourier coefficients. 3

124 The book then presents an argument as to why the normalized Fourier coefficients are the index location invariant. We present a somewhat different version of that argument here. Since the Fourier coefficients change depending on the k origin (see Equation 95 above) we would like to define Fourier features that are independent of this choice in origin. One way to do that is the following. One simply computes the Fourier features directly using the definition f l = N k=0 u k e j π N lk, (96) ignoring any issue of k origin. Then we explicitly write the first Fourier complex number f in polar form as f = f e jφ (note the negative sign in the angle). With this definition of φ as the polar angle for the first Fourier feature we define the normalized Fourier coefficients ˆf l as ˆf l = f l exp(jlφ ), (97) that is we multiply each of the previously computed Fourier coefficient by a power of the complex phase of f. We claim that these normalized Fourier coefficients are invariant to the choice of the sampling origin in k. To show this imagine that we had selected a different origin (say k 0 ) to evaluate u k at. Then from Equation 95 the first Fourier feature would be transformed to f given by the product of the old value f times a phase shift or f = f e jπk 0 N = f e jφ e jπ k 0 N = f e j(φ +π k 0 N ). Thus the angular phase we would extract as the polar angle from f is given by φ = φ +π k 0 N. The normalized Fourier coefficients we compute for this shifted k origin path u k using Equation 97 again ˆf l = f l exp(jlφ ). Again using Equation 95 to evaluate f l and replacing φ by what we found above we get ( ˆf l = f le j π N lk 0 exp jl(φ +π k ) 0 N ) = f l exp(jlφ ) = ˆf l, showing that the normalized Fourier coefficients are indeed invariant with respect to where we start sampling u k in k. Problem Solutions Problem 7. (the ASM and the CON features) This problem is worked in the MATLAB script prob 7.m. 4

3 (image contrast) Consider the image I = 0 0 0 0 0 0 0 0 Then notice that this image alternates between 0 and when we look along the φ = 0 direction but has the same value when we look along the φ =

125 Figure 8: Left: A plot of the books image.png. Right: A plot of the books image3.png. Problem 7. (the run-length matrices) This problem is worked in the MATLAB script prob 7.m. Problem 7.3 (image contrast) Consider the image I = Then notice that this image alternates between 0 and when we look along the φ = 0 direction but has the same value when we look along the φ = 45 direction. Our routines give CON(I,,0) = while CON(I,,45) = 0. This problem is worked in the MATLAB script prob 7 3.m. Problem 7.4 (feature extraction on various test images) For this problem we use the MATLAB command imread to load in two test images from the book. The imread command creates a matrix with quantized gray levels that can be processed by the routines developed on Page 0. The two test images selected (and converted to postscript for display in this document) are shown in Figure 8. This problem is worked in the MATLAB script prob 7 4.m. When that script is run we get the outputs given in Table. 5

Bayesian Decision and Bayesian Learning

Bayesian Decision and Bayesian Learning Ying Wu Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208 http://www.eecs.northwestern.edu/~yingwu 1 / 30 Bayes Rule p(x ω i