Application: Can we tell what people are looking at from their brain activity (in real time)? Gaussian Spatial Smooth

Size: px

Start display at page:

Download "Application: Can we tell what people are looking at from their brain activity (in real time)? Gaussian Spatial Smooth"

Felicity Hood
5 years ago
Views:

1 Application: Can we tell what people are looking at from their brain activity (in real time? Gaussian Spatial Smooth 0

2 The Data Block Paradigm (six runs per subject Three Categories of Objects (counterbalanced across runs Chairs, Faces and Houses Phase scrambled control stimulus Two Tasks Delayed matching Passive Viewing 1

3 Patterns of Activation Activity Space voxels are considered as axes in a high dimensional space Every brain response can be represented as a point in the multidimensional activity space

4 Activity Space Activity Space Voxels not Informative Activity Space Voxels are Informative 3

5 Discriminant Functions for the Normal Density We saw that the minimum error-rate classification can be achieved by the discriminant function g i (x = ln P(x w i + ln P(w i Case of multivariate normal g i (x = - 1 (x - m i t S ī 1 (x - m i - d lnp - 1 ln S i + lnp(w i 4

6 Special case S i =S g i (x = - 1 (x - m i t S ī 1 (x - m i - d lnp - 1 ln S i + lnp(w i g i (x - g j (x > 0 = - 1 (x - m i t S -1 (x - m i - d lnp - 1 ln S + ln P(w i - Ê - 1 (x - m j t S -1 (x - m j - d lnp - 1 ln S + lnp(w ˆ Á j Ë Now (x - m i t S -1 (x - m i = x t S -1 x - m t i S -1 x + m t i S -1 m i = m t i S -1 x - m t j S -1 x - 1 m t i S -1 m i + 1 m t js -1 m j + ln P(w i P(w j = w t i x - w t j x + w i0 - w j 0 5

7 Case S i = s.i (I stands for the identity matrix g i (x = w t i x + w i0 (linear discriminant function where : w i = m i s ; w = - 1 i0 s m t im i + lnp(w i (w i0 is called the threshold for the ith category! 6

8 A classifier that uses linear discriminant functions is called a linear machine The decision surfaces for a linear machine are pieces of hyperplanes defined by g i (x = g j (x g i (x - g j (x = w t i x + w i0 - w t j x + w j 0 = ( w t t i - w j x + ( w i0 - w j 0 = ( w t t i - w j (x - x 0 For Identity covariance case : = 1 ( s m t - m t i (x - x j 0 7

9 8

10 The hyperplane separating R i and R j x 1 s P( w i ( mi + m j - ln ( m i - m - m P( w j 0 = m j i j always orthogonal to the line linking the means! if 1 P( w i = P( w j then x0 = ( mi + m j 9

11 10

12 11

13 Case S i = S (covariance of all classes are identical but arbitrary! Hyperplane separating R i and R j [ P( w / P( w ] 1 ln i j x0 = ( mi + m j -.( m t 1 i - m - j ( m - m S ( m - m i j i j (the hyperplane separating R i and R j is generally not orthogonal to the line between the means! 1

14 13

15 14

16 Case S i = arbitrary The covariance matrices are different for each category where W w w i i : g = - = S i = - ( 1-1 i 1 x S m i = -1 i t m S W x 1 ln S (Hyperquadrics i0which are: ihyperplanes, i i pairs of i hyperplanes, hyperspheres, hyperellipsoids, hyperparaboloids, hyperhyperboloids x -1 t m i - + w t i x = w i0 + ln P( w i 15

17 16

18 17

19 Estimating model parameters Plug in estimates: Use procedure to get Mean and Covariance. Plug these values in. Maximum likelihood Maximum A posteriori Overfits training data Bayesian estimates: Take into account the reliability of your mean and covariance estimates to get better generalization. 18

20 Bayesian Estimation (Bayesian learning to pattern classification problems MLE: q presumed fixed BE: q random variable (ignorant of value The computation of posterior probabilities P(w i x lies at the heart of Bayesian classification Goal: compute P(w i x, D Given the sample D, Bayes formula can be written P(x wi, D.P( wi D P( wi x, D = c ÂP(x w, D.P( w D j= 1 j j 19 3

21 Maximum-Likelihood Estimation Has good convergence properties as the sample size increases Simpler than any other alternative techniques General principle Assume we have c classes and P(x w j ~ N( m j, S j P(x w j P (x w j, q j where: q = (m j,s j = (m 1 j,m j,...,s 11 j,s j,cov(x m j,x n j... 0

22 ML Problem Statement Let D = {x 1, x,, x n } P(x 1,, x n q = P 1,n P(x k q; D = n Our goal is to determine qˆ (value of q that makes this sample the most representative! 1

23 Use the training samples to estimate: q = (q 1, q,, q c, each q i (i = 1,,, c is associated with each category Suppose that D contains n samples, x 1, x,, x n P(D P(D q = k= n P(x k= 1 k q = F( q q is called the likelihood of q w.r.t. the set of samples qˆ ML estimate of q is, by definition the value that maximizes P(D q It is the value of q that best agrees with the actually observed training sample

24 3

25 Optimal estimation Let q = (q 1, q,, q p t and let q be the gradient operator q = È Í Î q,,..., 1 q qp t We define l(q as the log-likelihood function l(q = ln P(D q Find q that maximizes the log-likelihood qˆ = argmaxl( q q 4

26 Set of necessary conditions for an optimum is: k= n ( ql = Â q lnp(xk k= 1 q q l = 0 5

27 Example of a specific case: unknown m P(x i m ~ N(m, S (Samples from a multivariate normal dist. lnp(x k m = - 1 ln [ (pd S] - 1 (x k - m t S -1 (x k - m and m lnp(x k m = S -1 (x k - m q = m therefore: The ML estimate for m must satisfy: n Â S -1 k=1 n (x k - ˆ m = 0 = S -1 Â (x k - ˆ m = 0 n k=1 = Â x k - n ˆ m = 0 fi ˆ m = 1 n k=1 n Â k=1 x k 6

28 Multiplying by S and rearranging, we obtain: ˆ m = 1 n n Â x k k=1 Just the arithmetic average of the samples of the training samples! Conclusion: If P(x k w j (j = 1,,, c is a Gaussian in a d- dimensional feature space Then we can estimate the vector q = (q 1, q,, q c t and perform classification! 7

29 ML Estimation: Gaussian Case: unknown m and s q = (q 1, q = (m, s l = lnp(x k q = - 1 lnpq - 1 (x k -q 1 q Ê ˆ Á (lnp(x k q q q l = Á 1 = 0 Á (lnp(x k q Ë q Ê 1 ˆ Á (x k -q 1 q Ê Á (x -q = 0 ˆ Á k 1 Á Ë 0 Ë q q 8

30 Summation: Ï Ô Ì n Â k=1 Ô - Ó Ô n Â k=1 1 (x -q = 0 (1 q ˆ k q ˆ k= n Â k=1 (x k - ˆ q 1 q ˆ = 0 ( Combining (1 and (, one obtains: m = n x k Â ; s = 1 n n k=1 n Â k=1 (x k - m 9

31 Bias ML estimate for s is biased È EÍ 1 Î n n Â i=1 (x i - x = n -1 n.s s An elementary unbiased estimator for S is: n Â C = 1 (x - m(x - ˆ m t n -1 k k 1444 k= Sample covariance matrix 30

32 Problems with using the ML estimate as a Plug-in estimate Population Distribution... x x. 1 x. n. Repeated sampling N(m j, S j = P(x j w k ˆ q P(x j w k P(x j w k q ˆ 1 D 1 x 11. x 10. D. x x x 1. q ˆ x 7. x 9. D c q ˆ c x 81.. x 6 x x19

33 Bayesian Estimation (Bayesian learning to pattern classification problems In MLE q is presumed fixed In BE q is a random variable The computation of posterior probabilities P(w i x lies at the heart of Bayesian classification Goal: compute P(w i x, D Given the sample D, Bayes formula can be written P(w i x,d = c Â j=1 P(x w i,d.p(w i D P(x w j,d.p(w j D 3 3

34 Problem: Find P(x w i = P(x r q i, (but r q i unknown Solution : Learn P( r from data q i D then P(x w i = Ú K Ú P(x r q i P( r q i D d r q i and P(w i = P(w i D (Training sample provides this! Thus : P(w i x,d = P(x w i,d i.p(w i c Â P(x w j,d.p(w j j=1 33 3

35 Bayesian Parameter Estimation: Gaussian Case Goal: Estimate q using the a-posteriori density P(q D The univariate case: P(m D m is the only unknown parameter P(x m ~ N( m, s P( m ~ N( m 0, s 0 (m 0 and s 0 are known! 34 4

36 P(m D = Ú P(D m.p(m P(D m.p(mdm n k=1 = a P(x k m.p(m Gaussian is a Reproducing density (1 P(m D ~ N(m n,s n ( Identifying (1 and ( yields: m n = and Ê Á Ën s n 0 ns s = 0 0 s ns + s 0 0 s ˆ mˆ + s n 0 s + ns + s. m

37 36 4

38 The univariate case P(x D P(m D computed P(x D remains to be computed! P(x D = ÚP(x m.p( m D dm is Gaussian It provides: P(x D ~ N( m n, s + s n (Desired class-conditional density P(x D j, w j Therefore: P(x D j, w j together with P(w j And using Bayes formula, we obtain the [ P( w x, D] Max[ P(x w, D.P( w ] Bayesian classification rule: Max wj j wj j j 37 4 j

39 Bayesian Parameter Estimation: General Theory P(x D computation can be applied to any situation in which the unknown density can be parametrized: the basic assumptions are: The form of P(x q is assumed known, but the value of q is not known exactly Our (pre-data knowledge about q is assumed to be contained in a known prior density P(q The rest of our knowledge q is contained in a set D of n random variables x 1, x,, x n that follows P(x q 38 5

40 The basic problem is: Compute the posterior density P(q D then Derive P(x D Using Bayes formula, we have: P( q D = P( D q.p( q ÚP( D q.p( qdq, And by independence assumption: P( D q = k= n k= 1 P(x k q 39 5

41 Problems of Dimensionality Problems involving 50 or 100 features (binary valued Classification accuracy depends upon the dimensionality and the amount of training data Case of two classes multivariate normal with the same covariance P(error where : = r 1 p = ( m -u Ú e r / 1 - m du t S -1 ( m 1 - m lim P(error ræ =

42 If features are independent then: S = diag(s 1,s,...,s d Most useful features are the ones for which the difference between the means is large relative to the standard deviation r = d Â i=1 Ê m i1 - m i ˆ Á Ë s i It has frequently been observed in practice that, beyond a certain point, the inclusion of additional features leads to worse rather than better performance: i.e. we have the wrong model 41 7

43 7 4 7

44 43

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1) HW 1 due today Parameter Estimation Biometrics CSE 190 Lecture 7 Today s lecture was on the blackboard. These slides are an alternative presentation of the material. CSE190, Winter10 CSE190, Winter10 Chapter